summaryrefslogtreecommitdiff
path: root/kernel/rcu/rcu.h
AgeCommit message (Collapse)Author
2025-05-16Merge branches 'rcu/misc-for-6.16', 'rcu/seq-counters-for-6.16' and ↵Joel Fernandes
'rcu/torture-for-6.16' into rcu/for-next
2025-05-16rcutorture: Perform more frequent testing of ->gpwrapJoel Fernandes
Currently, the ->gpwrap is not tested (at all per my testing) due to the requirement of a large delta between a CPU's rdp->gp_seq and its node's rnp->gpseq. This results in no testing of ->gpwrap being set. This patch by default adds 5 minutes of testing with ->gpwrap forced by lowering the delta between rdp->gp_seq and rnp->gp_seq to just 8 GPs. All of this is configurable, including the active time for the setting and a full testing cycle. By default, the first 25 minutes of a test will have the _default_ behavior there is right now (ULONG_MAX / 4) delta. Then for 5 minutes, we switch to a smaller delta causing 1-2 wraps in 5 minutes. I believe this is reasonable since we at least add a little bit of testing for usecases where ->gpwrap is set. [ Apply fix for Dan Carpenter's bug report on init path cleanup. ] [ Apply kernel doc warning fix from Akira Yokosawa. ] Tested-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
2025-04-08rcu: Comment on the extraneous delta test on rcu_seq_done_exact()Frederic Weisbecker
The numbers used in rcu_seq_done_exact() lack some explanation behind their magic. Especially after the commit: 85aad7cc4178 ("rcu: Fix get_state_synchronize_rcu_full() GP-start detection") which reported a subtle issue where a new GP sequence snapshot was taken on the root node state while a grace period had already been started and reflected on the global state sequence but not yet on the root node sequence, making a polling user waiting on a wrong already started grace period that would ignore freshly online CPUs. The fix involved taking the snaphot on the global state sequence and waiting on the root node sequence. And since a grace period is first started on the global state and only afterward reflected on the root node, a snapshot taken on the global state sequence might be two full grace periods ahead of the root node as in the following example: rnp->gp_seq = rcu_state.gp_seq = 0 CPU 0 CPU 1 ----- ----- // rcu_state.gp_seq = 1 rcu_seq_start(&rcu_state.gp_seq) // snap = 8 snap = rcu_seq_snap(&rcu_state.gp_seq) // Two full GP differences rcu_seq_done_exact(&rnp->gp_seq, snap) // rnp->gp_seq = 1 WRITE_ONCE(rnp->gp_seq, rcu_state.gp_seq); Add a comment about those expectations and to clarify the magic within the relevant function. Note that the issue arises mainly with the use of rcu_seq_done_exact() which has a much tigher guardband (of 2 GPs) to ensure the false-negative window of the API during wraparound is limited to just 2 GPs. rcu_seq_done() does not have such strict requirements, however its large false-negative window of ULONG_MAX/2 is not ideal for the polling API. However, this also means care is needed to ensure the guardband is as large as needed to avoid the example scenario describe above which a warning added in an earlier patch does. [ Comment wordsmithing by Joel ] Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
2025-04-08rcu: Replace magic number with meaningful constant in rcu_seq_done_exact()Joel Fernandes
The rcu_seq_done_exact() function checks if a grace period has completed by comparing sequence numbers. It includes a guard band to handle sequence number wraparound, which was previously expressed using the magic number calculation '3 * RCU_SEQ_STATE_MASK + 1'. This magic number is not immediately obvious in terms of what it represents. Instead, the reason we need this tiny guardband is because of the lag between the setting of rcu_state.gp_seq_polled and root rnp's gp_seq in rcu_gp_init(). This guardband needs to be at least 2 GPs worth of counts, to avoid recognizing the newly started GP as completed immediately, due to the following sequence which arises due to the delay between update of rcu_state.gp_seq_polled and root rnp's gp_seq: rnp->gp_seq = rcu_state.gp_seq = 0 CPU 0 CPU 1 ----- ----- // rcu_state.gp_seq = 1 rcu_seq_start(&rcu_state.gp_seq) // snap = 8 snap = rcu_seq_snap(&rcu_state.gp_seq) // Two full GP differences rcu_seq_done_exact(&rnp->gp_seq, snap) // rnp->gp_seq = 1 WRITE_ONCE(rnp->gp_seq, rcu_state.gp_seq); This can happen due to get_state_synchronize_rcu_full() sampling rcu_state.gp_seq_polled, however the poll_state_synchronize_rcu_full() sampling the root rnp's gp_seq. The delay between the update of the 2 counters occurs in rcu_gp_init() during which the counters briefly go out of sync. Make the guardband explictly 2 GPs. This improves code readability and maintainability by making the intent clearer as well. Suggested-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
2025-03-04Merge branches 'docs.2025.02.04a', 'lazypreempt.2025.03.04a', ↵Boqun Feng
'misc.2025.03.04a', 'srcu.2025.02.05a' and 'torture.2025.02.05a'
2025-03-04rcu: Fix get_state_synchronize_rcu_full() GP-start detectionPaul E. McKenney
The get_state_synchronize_rcu_full() and poll_state_synchronize_rcu_full() functions use the root rcu_node structure's ->gp_seq field to detect the beginnings and ends of grace periods, respectively. This choice is necessary for the poll_state_synchronize_rcu_full() function because (give or take counter wrap), the following sequence is guaranteed not to trigger: get_state_synchronize_rcu_full(&rgos); synchronize_rcu(); WARN_ON_ONCE(!poll_state_synchronize_rcu_full(&rgos)); The RCU callbacks that awaken synchronize_rcu() instances are guaranteed not to be invoked before the root rcu_node structure's ->gp_seq field is updated to indicate the end of the grace period. However, these callbacks might start being invoked immediately thereafter, in particular, before rcu_state.gp_seq has been updated. Therefore, poll_state_synchronize_rcu_full() must refer to the root rcu_node structure's ->gp_seq field. Because this field is updated under this structure's ->lock, any code following a call to poll_state_synchronize_rcu_full() will be fully ordered after the full grace-period computation, as is required by RCU's memory-ordering semantics. By symmetry, the get_state_synchronize_rcu_full() function should also use this same root rcu_node structure's ->gp_seq field. But it turns out that symmetry is profoundly (though extremely infrequently) destructive in this case. To see this, consider the following sequence of events: 1. CPU 0 starts a new grace period, and updates rcu_state.gp_seq accordingly. 2. As its first step of grace-period initialization, CPU 0 examines the current CPU hotplug state and decides that it need not wait for CPU 1, which is currently offline. 3. CPU 1 comes online, and updates its state. But this does not affect the current grace period, but rather the one after that. After all, CPU 1 was offline when the current grace period started, so all pre-existing RCU readers on CPU 1 must have completed or been preempted before it last went offline. The current grace period therefore has nothing it needs to wait for on CPU 1. 4. CPU 1 switches to an rcutorture kthread which is running rcutorture's rcu_torture_reader() function, which starts a new RCU reader. 5. CPU 2 is running rcutorture's rcu_torture_writer() function and collects a new polled grace-period "cookie" using get_state_synchronize_rcu_full(). Because the newly started grace period has not completed initialization, the root rcu_node structure's ->gp_seq field has not yet been updated to indicate that this new grace period has already started. This cookie is therefore set up for the end of the current grace period (rather than the end of the following grace period). 6. CPU 0 finishes grace-period initialization. 7. If CPU 1’s rcutorture reader is preempted, it will be added to the ->blkd_tasks list, but because CPU 1’s ->qsmask bit is not set in CPU 1's leaf rcu_node structure, the ->gp_tasks pointer will not be updated.  Thus, this grace period will not wait on it.  Which is only fair, given that the CPU did not come online until after the grace period officially started. 8. CPUs 0 and 2 then detect the new grace period and then report a quiescent state to the RCU core. 9. Because CPU 1 was offline at the start of the current grace period, CPUs 0 and 2 are the only CPUs that this grace period needs to wait on. So the grace period ends and post-grace-period cleanup starts. In particular, the root rcu_node structure's ->gp_seq field is updated to indicate that this grace period has now ended. 10. CPU 2 continues running rcu_torture_writer() and sees that, from the viewpoint of the root rcu_node structure consulted by the poll_state_synchronize_rcu_full() function, the grace period has ended.  It therefore updates state accordingly. 11. CPU 1 is still running the same RCU reader, which notices this update and thus complains about the too-short grace period. The fix is for the get_state_synchronize_rcu_full() function to use rcu_state.gp_seq instead of the root rcu_node structure's ->gp_seq field. With this change in place, if step 5's cookie indicates that the grace period has not yet started, then any prior code executed by CPU 2 must have happened before CPU 1 came online. This will in turn prevent CPU 1's code in steps 3 and 11 from spanning CPU 2's grace-period wait, thus preventing CPU 1 from being subjected to a too-short grace period. This commit therefore makes this change. Note that there is no change to the poll_state_synchronize_rcu_full() function, which as noted above, must continue to use the root rcu_node structure's ->gp_seq field. This is of course an asymmetry between these two functions, but is an asymmetry that is absolutely required for correct operation. It is a common human tendency to greatly value symmetry, and sometimes symmetry is a wonderful thing. Other times, symmetry results in poor performance. But in this case, symmetry is just plain wrong. Nevertheless, the asymmetry does require an additional adjustment. It is possible for get_state_synchronize_rcu_full() to see a given grace period as having started, but for an immediately following poll_state_synchronize_rcu_full() to see it as having not yet started. Given the current rcu_seq_done_exact() implementation, this will result in a false-positive indication that the grace period is done from poll_state_synchronize_rcu_full(). This is dealt with by making rcu_seq_done_exact() reach back three grace periods rather than just two of them. However, simply changing get_state_synchronize_rcu_full() function to use rcu_state.gp_seq instead of the root rcu_node structure's ->gp_seq field results in a theoretical bug in kernels booted with rcutree.rcu_normal_wake_from_gp=1 due to the following sequence of events: o The rcu_gp_init() function invokes rcu_seq_start() to officially start a new grace period. o A new RCU reader begins, referencing X from some RCU-protected list. The new grace period is not obligated to wait for this reader. o An updater removes X, then calls synchronize_rcu(), which queues a wait element. o The grace period ends, awakening the updater, which frees X while the reader is still referencing it. The reason that this is theoretical is that although the grace period has officially started, none of the CPUs are officially aware of this, and thus will have to assume that the RCU reader pre-dated the start of the grace period. Detailed explanation can be found at [2] and [3]. Except for kernels built with CONFIG_PROVE_RCU=y, which use the polled grace-period APIs, which can and do complain bitterly when this sequence of events occurs. Not only that, there might be some future RCU grace-period mechanism that pulls this sequence of events from theory into practice. This commit therefore also pulls the call to rcu_sr_normal_gp_init() to precede that to rcu_seq_start(). Although this fixes commit 91a967fd6934 ("rcu: Add full-sized polling for get_completed*() and poll_state*()"), it is not clear that it is worth backporting this commit. First, it took me many weeks to convince rcutorture to reproduce this more frequently than once per year. Second, this cannot be reproduced at all without frequent CPU-hotplug operations, as in waiting all of 50 milliseconds from the end of the previous operation until starting the next one. Third, the TREE03.boot settings cause multi-millisecond delays during RCU grace-period initialization, which greatly increase the probability of the above sequence of events. (Don't do this in production workloads!) Fourth, the TREE03 rcutorture scenario was modified to use four-CPU guest OSes, to have a single-rcu_node combining tree, no testing of RCU priority boosting, and no random preemption, and these modifications were necessary to reproduce this issue in a reasonable timeframe. Fifth, extremely heavy use of get_state_synchronize_rcu_full() and/or poll_state_synchronize_rcu_full() is required to reproduce this, and as of v6.12, only kfree_rcu() uses it, and even then not particularly heavily. [boqun: Apply the fix [1], and add the comment before the moved rcu_sr_normal_gp_init(). Additional links are added for explanation.] Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Tested-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Link: https://lore.kernel.org/rcu/d90bd6d9-d15c-4b9b-8a69-95336e74e8f4@paulmck-laptop/ [1] Link: https://lore.kernel.org/rcu/20250303001507.GA3994772@joelnvbox/ [2] Link: https://lore.kernel.org/rcu/Z8bcUsZ9IpRi1QoP@pc636/ [3] Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2025-02-05rcutorture: Make cur_ops->format_gp_seqs take buffer lengthPaul E. McKenney
The Tree and Tiny implementations of rcutorture_format_gp_seqs() use hard-coded constants for the length of the buffer that they format into. This is of course an accident waiting to happen, so this commit therefore makes them take a length argument. The rcutorture calling code uses ARRAY_SIZE() to safely compute this new argument. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2025-02-05rcutorture: Expand failure/close-call grace-period outputPaul E. McKenney
With only eight bits per grace-period sequence number, wrap can happen in 64 grace periods. This commit therefore increases this to sixteen bits for normal grace-period sequence numbers and the combined short-form polling sequence numbers, thus deferring wrap for at least 16,384 grace periods. Because expedited grace periods go faster, expand these to 24 bits, deferring wrap for at least 4,194,304 expedited grace periods. These longer wrap times makes it easier to correlate these numbers to trace-event output. Note that the low-order two bits are reserved for intra-grace-period state, hence the above wrap numbers being a factor of four smaller than you might expect. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2025-02-05rcutorture: Include grace-period sequence numbers in failure/close-callPaul E. McKenney
This commit includes the grace-period sequence numbers at the beginning and end of each segment in the "Failure/close-call rcutorture reader segments" list. These are in hexadecimal, and only the bottom byte. Currently, only RCU is supported, with its three sequence numbers (normal, expedited, and polled). Note that if all the grace-period sequence numbers remain the same across a given reader segment, only one copy of the number will be printed. Of course, if there is a change, both sets of values will be printed. Because the overhead of collecting this information can suppress heisenbugs, this information is collected and printed only in kernels built with CONFIG_RCU_TORTURE_TEST_LOG_GP=y. [ paulmck: Apply Nathan Chancellor feedback for IS_ENABLED(). ] [ paulmck: Apply feedback from kernel test robot. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2025-02-05srcu: Make Tiny SRCU able to operate in preemptible kernelsPaul E. McKenney
Given that SRCU allows its read-side critical sections are not just preemptible, but also allow general blocking, there is not much reason to restrict Tiny SRCU to non-preemptible kernels. This commit therefore removes Tiny SRCU dependencies on non-preemptibility, primarily surrounding its interaction with rcutorture and early boot. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Ankur Arora <ankur.a.arora@oracle.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: <bpf@vger.kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-09-09Merge branches 'context_tracking.15.08.24a', 'csd.lock.15.08.24a', ↵Neeraj Upadhyay
'nocb.09.09.24a', 'rcutorture.14.08.24a', 'rcustall.09.09.24a', 'srcu.12.08.24a', 'rcu.tasks.14.08.24a', 'rcu_scaling_tests.15.08.24a', 'fixes.12.08.24a' and 'misc.11.08.24a' into next.09.09.24a
2024-08-15rcu: Rename rcu_dynticks_zero_in_eqs() into rcu_watching_zero_in_eqs()Valentin Schneider
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to RCU_WATCHING, reflect that change in the related helpers. Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-14rcu: Provide rcu_barrier_cb_is_done() to check rcu_barrier() CBsPaul E. McKenney
This commit provides a rcu_barrier_cb_is_done() function that returns true if the *rcu_barrier*() callback passed in is done. This will be used when printing grace-period debugging information. Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-12srcu: faster gp seq wrap-aroundJP Kobryn
Using a higher value for the initial gp sequence counters allows for wrapping to occur faster. It can help with surfacing any issues that may be happening as a result of the wrap around. Signed-off-by: JP Kobryn <inwardvessel@gmail.com> Tested-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-04-16rcutorture: Make rcutorture support print rcu-tasks gp stateZqiang
This commit make rcu-tasks related rcutorture test support rcu-tasks gp state printing when the writer stall occurs or the at the end of rcutorture test, and generate rcu_ops->get_gp_data() operation to simplify the acquisition of gp state for different types of rcutorture tests. Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2024-02-26Merge branches 'rcu-doc.2024.02.14a', 'rcu-nocb.2024.02.14a', ↵Boqun Feng
'rcu-exp.2024.02.14a', 'rcu-tasks.2024.02.26a' and 'rcu-misc.2024.02.14a' into rcu.2024.02.26a
2024-02-25rcu-tasks: Initialize callback lists at rcu_init() timePaul E. McKenney
In order for RCU Tasks to reliably maintain per-CPU lists of exiting tasks, those lists must be initialized before it is possible for tasks to exit, especially given that the boot CPU is not necessarily CPU 0 (an example being, powerpc kexec() kernels). And at the time that rcu_init_tasks_generic() is called, a task could potentially exit, unconventional though that sort of thing might be. This commit therefore moves the calls to cblist_init_generic() from functions called from rcu_init_tasks_generic() to a new function named tasks_cblist_init_generic() that is invoked from rcu_init(). This constituted a bug in a commit that never went to mainline, so there is no need for any backporting to -stable. Reported-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14rcu: Rename jiffies_till_flush to jiffies_lazy_flushFrederic Weisbecker
The variable name jiffies_till_flush is too generic and therefore: * It may shadow a global variable * It doesn't tell on what it operates Make the name more precise, along with the related APIs. Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14rcu/exp: Remove rcu_par_gp_wqFrederic Weisbecker
TREE04 running on short iterations can produce writer stalls of the following kind: ??? Writer stall state RTWS_EXP_SYNC(4) g3968 f0x0 ->state 0x2 cpu 0 task:rcu_torture_wri state:D stack:14568 pid:83 ppid:2 flags:0x00004000 Call Trace: <TASK> __schedule+0x2de/0x850 ? trace_event_raw_event_rcu_exp_funnel_lock+0x6d/0xb0 schedule+0x4f/0x90 synchronize_rcu_expedited+0x430/0x670 ? __pfx_autoremove_wake_function+0x10/0x10 ? __pfx_synchronize_rcu_expedited+0x10/0x10 do_rtws_sync.constprop.0+0xde/0x230 rcu_torture_writer+0x4b4/0xcd0 ? __pfx_rcu_torture_writer+0x10/0x10 kthread+0xc7/0xf0 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x2f/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1b/0x30 </TASK> Waiting for an expedited grace period and polling for an expedited grace period both are operations that internally rely on the same workqueue performing necessary asynchronous work. However, a dependency chain is involved between those two operations, as depicted below: ====== CPU 0 ======= ====== CPU 1 ======= synchronize_rcu_expedited() exp_funnel_lock() mutex_lock(&rcu_state.exp_mutex); start_poll_synchronize_rcu_expedited queue_work(rcu_gp_wq, &rnp->exp_poll_wq); synchronize_rcu_expedited_queue_work() queue_work(rcu_gp_wq, &rew->rew_work); wait_event() // A, wait for &rew->rew_work completion mutex_unlock() // B //======> switch to kworker sync_rcu_do_polled_gp() { synchronize_rcu_expedited() exp_funnel_lock() mutex_lock(&rcu_state.exp_mutex); // C, wait B .... } // D Since workqueues are usually implemented on top of several kworkers handling the queue concurrently, the above situation wouldn't deadlock most of the time because A then doesn't depend on D. But in case of memory stress, a single kworker may end up handling alone all the works in a serialized way. In that case the above layout becomes a problem because A then waits for D, closing a circular dependency: A -> D -> C -> B -> A This however only happens when CONFIG_RCU_EXP_KTHREAD=n. Indeed synchronize_rcu_expedited() is otherwise implemented on top of a kthread worker while polling still relies on rcu_gp_wq workqueue, breaking the above circular dependency chain. Fix this with making expedited grace period to always rely on kthread worker. The workqueue based implementation is essentially a duplicate anyway now that the per-node initialization is performed by per-node kthread workers. Meanwhile the CONFIG_RCU_EXP_KTHREAD switch is still kept around to manage the scheduler policy of these kthread workers. Reported-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Reported-by: Thomas Gleixner <tglx@linutronix.de> Suggested-by: Joel Fernandes <joel@joelfernandes.org> Suggested-by: Paul E. McKenney <paulmck@kernel.org> Suggested-by: Neeraj upadhyay <Neeraj.Upadhyay@amd.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14rcu/exp: Make parallel exp gp kworker per rcu nodeFrederic Weisbecker
When CONFIG_RCU_EXP_KTHREAD=n, the expedited grace period per node initialization is performed in parallel via workqueues (one work per node). However in CONFIG_RCU_EXP_KTHREAD=y, this per node initialization is performed by a single kworker serializing each node initialization (one work for all nodes). The second part is certainly less scalable and efficient beyond a single leaf node. To improve this, expand this single kworker into per-node kworkers. This new layout is eventually intended to remove the workqueues based implementation since it will essentially now become duplicate code. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2023-12-12rcu: Restrict access to RCU CPU stall notifiersPaul E. McKenney
Although the RCU CPU stall notifiers can be useful for dumping state when tracking down delicate forward-progress bugs where NUMA effects cause cache lines to be delivered to a given CPU regularly, but always in a state that prevents that CPU from making forward progress. These bugs can be detected by the RCU CPU stall-warning mechanism, but in some cases, the stall-warnings printk()s disrupt the forward-progress bug before any useful state can be obtained. Unfortunately, the notifier mechanism added by commit 5b404fdabacf ("rcu: Add RCU CPU stall notifier") can make matters worse if used at all carelessly. For example, if the stall warning was caused by a lock not being released, then any attempt to acquire that lock in the notifier will hang. This will prevent not only the notifier from producing any useful output, but it will also prevent the stall-warning message from ever appearing. This commit therefore hides this new RCU CPU stall notifier mechanism under a new RCU_CPU_STALL_NOTIFIER Kconfig option that depends on both DEBUG_KERNEL and RCU_EXPERT. In addition, the rcupdate.rcu_cpu_stall_notifiers=1 kernel boot parameter must also be specified. The RCU_CPU_STALL_NOTIFIER Kconfig option's help text contains a warning and explains the dangers of careless use, recommending lockless notifier code. In addition, a WARN() is triggered each time that an attempt is made to register a stall-warning notifier in kernels built with CONFIG_RCU_CPU_STALL_NOTIFIER=y. This combination of measures will keep use of this mechanism confined to debug kernels and away from routine deployments. [ paulmck: Apply Dan Carpenter feedback. ] Fixes: 5b404fdabacf ("rcu: Add RCU CPU stall notifier") Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>
2023-11-01rcu: Introduce rcu_cpu_online()Frederic Weisbecker
Export the RCU point of view as to when a CPU is considered offline (ie: when does RCU consider that a CPU is sufficiently down in the hotplug process to not feature any possible read side). This will be used by RCU-tasks whose vision of an offline CPU should reasonably match the one of RCU core. Fixes: cff9b2332ab7 ("kernel/sched: Modify initial boot task idle setup") Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2023-10-23Merge branches 'rcu/torture', 'rcu/fixes', 'rcu/docs', 'rcu/refscale', ↵Frederic Weisbecker
'rcu/tasks' and 'rcu/stall' into rcu/next rcu/torture: RCU torture, locktorture and generic torture infrastructure rcu/fixes: Generic and misc fixes rcu/docs: RCU documentation updates rcu/refscale: RCU reference scalability test updates rcu/tasks: RCU tasks updates rcu/stall: Stall detection updates
2023-09-24torture: Move rcutorture_sched_setaffinity() out of rcutorturePaul E. McKenney
The rcutorture_sched_setaffinity() function is needed by locktorture, so move its declaration from rcu.h to torture.h and rename it to the more generic torture_sched_setaffinity() name. Please note that use of this function is still restricted to torture tests, and of those, currently only rcutorture and locktorture. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2023-09-13rcu: Dump memory object info if callback function is invalidZhen Lei
When a structure containing an RCU callback rhp is (incorrectly) freed and reallocated after rhp is passed to call_rcu(), it is not unusual for rhp->func to be set to NULL. This defeats the debugging prints used by __call_rcu_common() in kernels built with CONFIG_DEBUG_OBJECTS_RCU_HEAD=y, which expect to identify the offending code using the identity of this function. And in kernels build without CONFIG_DEBUG_OBJECTS_RCU_HEAD=y, things are even worse, as can be seen from this splat: Unable to handle kernel NULL pointer dereference at virtual address 0 ... ... PC is at 0x0 LR is at rcu_do_batch+0x1c0/0x3b8 ... ... (rcu_do_batch) from (rcu_core+0x1d4/0x284) (rcu_core) from (__do_softirq+0x24c/0x344) (__do_softirq) from (__irq_exit_rcu+0x64/0x108) (__irq_exit_rcu) from (irq_exit+0x8/0x10) (irq_exit) from (__handle_domain_irq+0x74/0x9c) (__handle_domain_irq) from (gic_handle_irq+0x8c/0x98) (gic_handle_irq) from (__irq_svc+0x5c/0x94) (__irq_svc) from (arch_cpu_idle+0x20/0x3c) (arch_cpu_idle) from (default_idle_call+0x4c/0x78) (default_idle_call) from (do_idle+0xf8/0x150) (do_idle) from (cpu_startup_entry+0x18/0x20) (cpu_startup_entry) from (0xc01530) This commit therefore adds calls to mem_dump_obj(rhp) to output some information, for example: slab kmalloc-256 start ffff410c45019900 pointer offset 0 size 256 This provides the rough size of the memory block and the offset of the rcu_head structure, which as least provides at least a few clues to help locate the problem. If the problem is reproducible, additional slab debugging can be enabled, for example, CONFIG_DEBUG_SLAB=y, which can provide significantly more information. Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2023-09-11rcu: Add RCU CPU stall notifierPaul E. McKenney
It is sometimes helpful to have a way for the subsystem causing the stall to dump its state when an RCU CPU stall occurs. This commit therefore bases rcu_stall_chain_notifier_register() and rcu_stall_chain_notifier_unregister() on atomic notifiers in order to provide this functionality. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2023-08-29Merge tag 'net-next-6.6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Paolo Abeni: "Core: - Increase size limits for to-be-sent skb frag allocations. This allows tun, tap devices and packet sockets to better cope with large writes operations - Store netdevs in an xarray, to simplify iterating over netdevs - Refactor nexthop selection for multipath routes - Improve sched class lifetime handling - Add backup nexthop ID support for bridge - Implement drop reasons support in openvswitch - Several data races annotations and fixes - Constify the sk parameter of routing functions - Prepend kernel version to netconsole message Protocols: - Implement support for TCP probing the peer being under memory pressure - Remove hard coded limitation on IPv6 specific info placement inside the socket struct - Get rid of sysctl_tcp_adv_win_scale and use an auto-estimated per socket scaling factor - Scaling-up the IPv6 expired route GC via a separated list of expiring routes - In-kernel support for the TLS alert protocol - Better support for UDP reuseport with connected sockets - Add NEXT-C-SID support for SRv6 End.X behavior, reducing the SR header size - Get rid of additional ancillary per MPTCP connection struct socket - Implement support for BPF-based MPTCP packet schedulers - Format MPTCP subtests selftests results in TAP - Several new SMC 2.1 features including unique experimental options, max connections per lgr negotiation, max links per lgr negotiation BPF: - Multi-buffer support in AF_XDP - Add multi uprobe BPF links for attaching multiple uprobes and usdt probes, which is significantly faster and saves extra fds - Implement an fd-based tc BPF attach API (TCX) and BPF link support on top of it - Add SO_REUSEPORT support for TC bpf_sk_assign - Support new instructions from cpu v4 to simplify the generated code and feature completeness, for x86, arm64, riscv64 - Support defragmenting IPv(4|6) packets in BPF - Teach verifier actual bounds of bpf_get_smp_processor_id() and fix perf+libbpf issue related to custom section handling - Introduce bpf map element count and enable it for all program types - Add a BPF hook in sys_socket() to change the protocol ID from IPPROTO_TCP to IPPROTO_MPTCP to cover migration for legacy - Introduce bpf_me_mcache_free_rcu() and fix OOM under stress - Add uprobe support for the bpf_get_func_ip helper - Check skb ownership against full socket - Support for up to 12 arguments in BPF trampoline - Extend link_info for kprobe_multi and perf_event links Netfilter: - Speed-up process exit by aborting ruleset validation if a fatal signal is pending - Allow NLA_POLICY_MASK to be used with BE16/BE32 types Driver API: - Page pool optimizations, to improve data locality and cache usage - Introduce ndo_hwtstamp_get() and ndo_hwtstamp_set() to avoid the need for raw ioctl() handling in drivers - Simplify genetlink dump operations (doit/dumpit) providing them the common information already populated in struct genl_info - Extend and use the yaml devlink specs to [re]generate the split ops - Introduce devlink selective dumps, to allow SF filtering SF based on handle and other attributes - Add yaml netlink spec for netlink-raw families, allow route, link and address related queries via the ynl tool - Remove phylink legacy mode support - Support offload LED blinking to phy - Add devlink port function attributes for IPsec New hardware / drivers: - Ethernet: - Broadcom ASP 2.0 (72165) ethernet controller - MediaTek MT7988 SoC - Texas Instruments AM654 SoC - Texas Instruments IEP driver - Atheros qca8081 phy - Marvell 88Q2110 phy - NXP TJA1120 phy - WiFi: - MediaTek mt7981 support - Can: - Kvaser SmartFusion2 PCI Express devices - Allwinner T113 controllers - Texas Instruments tcan4552/4553 chips - Bluetooth: - Intel Gale Peak - Qualcomm WCN3988 and WCN7850 - NXP AW693 and IW624 - Mediatek MT2925 Drivers: - Ethernet NICs: - nVidia/Mellanox: - mlx5: - support UDP encapsulation in packet offload mode - IPsec packet offload support in eswitch mode - improve aRFS observability by adding new set of counters - extends MACsec offload support to cover RoCE traffic - dynamic completion EQs - mlx4: - convert to use auxiliary bus instead of custom interface logic - Intel - ice: - implement switchdev bridge offload, even for LAG interfaces - implement SRIOV support for LAG interfaces - igc: - add support for multiple in-flight TX timestamps - Broadcom: - bnxt: - use the unified RX page pool buffers for XDP and non-XDP - use the NAPI skb allocation cache - OcteonTX2: - support Round Robin scheduling HTB offload - TC flower offload support for SPI field - Freescale: - add XDP_TX feature support - AMD: - ionic: add support for PCI FLR event - sfc: - basic conntrack offload - introduce eth, ipv4 and ipv6 pedit offloads - ST Microelectronics: - stmmac: maximze PTP timestamping resolution - Virtual NICs: - Microsoft vNIC: - batch ringing RX queue doorbell on receiving packets - add page pool for RX buffers - Virtio vNIC: - add per queue interrupt coalescing support - Google vNIC: - add queue-page-list mode support - Ethernet high-speed switches: - nVidia/Mellanox (mlxsw): - add port range matching tc-flower offload - permit enslavement to netdevices with uppers - Ethernet embedded switches: - Marvell (mv88e6xxx): - convert to phylink_pcs - Renesas: - r8A779fx: add speed change support - rzn1: enables vlan support - Ethernet PHYs: - convert mv88e6xxx to phylink_pcs - WiFi: - Qualcomm Wi-Fi 7 (ath12k): - extremely High Throughput (EHT) PHY support - RealTek (rtl8xxxu): - enable AP mode for: RTL8192FU, RTL8710BU (RTL8188GU), RTL8192EU and RTL8723BU - RealTek (rtw89): - Introduce Time Averaged SAR (TAS) support - Connector: - support for event filtering" * tag 'net-next-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1806 commits) net: ethernet: mtk_wed: minor change in wed_{tx,rx}info_show net: ethernet: mtk_wed: add some more info in wed_txinfo_show handler net: stmmac: clarify difference between "interface" and "phy_interface" r8152: add vendor/device ID pair for D-Link DUB-E250 devlink: move devlink_notify_register/unregister() to dev.c devlink: move small_ops definition into netlink.c devlink: move tracepoint definitions into core.c devlink: push linecard related code into separate file devlink: push rate related code into separate file devlink: push trap related code into separate file devlink: use tracepoint_enabled() helper devlink: push region related code into separate file devlink: push param related code into separate file devlink: push resource related code into separate file devlink: push dpipe related code into separate file devlink: move and rename devlink_dpipe_send_and_alloc_skb() helper devlink: push shared buffer related code into separate file devlink: push port related code into separate file devlink: push object register/unregister notifications into separate helpers inet: fix IP_TRANSPARENT error handling ...
2023-07-14rcuscale: fix building with RCU_TINYArnd Bergmann
Both the CONFIG_TASKS_RCU and CONFIG_TASKS_RUDE_RCU options are broken when RCU_TINY is enabled as well, as some functions are missing a declaration. In file included from kernel/rcu/update.c:649: kernel/rcu/tasks.h:1271:21: error: no previous prototype for 'get_rcu_tasks_rude_gp_kthread' [-Werror=missing-prototypes] 1271 | struct task_struct *get_rcu_tasks_rude_gp_kthread(void) | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ kernel/rcu/rcuscale.c:330:27: error: 'get_rcu_tasks_rude_gp_kthread' undeclared here (not in a function); did you mean 'get_rcu_tasks_trace_gp_kthread'? 330 | .rso_gp_kthread = get_rcu_tasks_rude_gp_kthread, | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | get_rcu_tasks_trace_gp_kthread In file included from /home/arnd/arm-soc/kernel/rcu/update.c:649: kernel/rcu/tasks.h:1113:21: error: no previous prototype for 'get_rcu_tasks_gp_kthread' [-Werror=missing-prototypes] 1113 | struct task_struct *get_rcu_tasks_gp_kthread(void) | ^~~~~~~~~~~~~~~~~~~~~~~~ Also, building with CONFIG_TASKS_RUDE_RCU but not CONFIG_TASKS_RCU is broken because of some missing stub functions: kernel/rcu/rcuscale.c:322:27: error: 'tasks_scale_read_lock' undeclared here (not in a function); did you mean 'srcu_scale_read_lock'? 322 | .readlock = tasks_scale_read_lock, | ^~~~~~~~~~~~~~~~~~~~~ | srcu_scale_read_lock kernel/rcu/rcuscale.c:323:27: error: 'tasks_scale_read_unlock' undeclared here (not in a function); did you mean 'srcu_scale_read_unlock'? 323 | .readunlock = tasks_scale_read_unlock, | ^~~~~~~~~~~~~~~~~~~~~~~ | srcu_scale_read_unlock Move the declarations outside of the RCU_TINY #ifdef and duplicate the shared stub functions to address all of the above. Fixes: 88d7ff818f0ce ("rcuscale: Add RCU Tasks Rude testing") Fixes: 755f1c5eb416b ("rcuscale: Measure RCU Tasks Trace grace-period kthread CPU time") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-07-14rcuscale: Add RCU Tasks Rude testingPaul E. McKenney
Add a "tasks-rude" option to the rcuscale.scale_type module parameter. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-07-14rcuscale: Measure RCU Tasks Trace grace-period kthread CPU timePaul E. McKenney
This commit causes RCU Tasks Trace to output the CPU time consumed by its grace-period kthread. The CPU time is whatever is in the designated task's current->stime field, and thus is controlled by whatever CPU-time accounting scheme is in effect. This output appears in microseconds as follows on the console: rcu_scale: Grace-period kthread CPU time: 42367.037 [ paulmck: Apply Willy Tarreau feedback. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-07-12rcu: Export rcu_request_urgent_qs_task()Paul E. McKenney
If a CPU is executing a long series of non-sleeping system calls, RCU grace periods can be delayed for on the order of a couple hundred milliseconds. This is normally not a problem, but if each system call does a call_rcu(), those callbacks can stack up. RCU will eventually notice this callback storm, but use of rcu_request_urgent_qs_task() allows the code invoking call_rcu() to give RCU a heads up. This function is not for general use, not yet, anyway. Reported-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230706033447.54696-11-alexei.starovoitov@gmail.com
2023-05-11rcu-tasks: Stop rcu_tasks_invoke_cbs() from using never-onlined CPUsPaul E. McKenney
The rcu_tasks_invoke_cbs() function relies on queue_work_on() to silently fall back to WORK_CPU_UNBOUND when the specified CPU is offline. However, the queue_work_on() function's silent fallback mechanism relies on that CPU having been online at some time in the past. When queue_work_on() is passed a CPU that has never been online, workqueue lockups ensue, which can be bad for your kernel's general health and well-being. This commit therefore checks whether a given CPU has ever been online, and, if not substitutes WORK_CPU_UNBOUND in the subsequent call to queue_work_on(). Why not simply omit the queue_work_on() call entirely? Because this function is flooding callback-invocation notifications to all CPUs, and must deal with possibilities that include a sparse cpu_possible_mask. This commit also moves the setting of the rcu_data structure's ->beenonline field to rcu_cpu_starting(), which executes on the incoming CPU before that CPU has ever enabled interrupts. This ensures that the required workqueues are present. In addition, because the incoming CPU has not yet enabled its interrupts, there cannot yet have been any softirq handlers running on this CPU, which means that the WARN_ON_ONCE(!rdp->beenonline) within the RCU_SOFTIRQ handler cannot have triggered yet. Fixes: d363f833c6d88 ("rcu-tasks: Use workqueues for multiple rcu_tasks_invoke_cbs() invocations") Reported-by: Tejun Heo <tj@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-04-05Merge branches 'rcu/staging-core', 'rcu/staging-docs' and ↵Joel Fernandes (Google)
'rcu/staging-kfree', remote-tracking branches 'paul/srcu-cf.2023.04.04a', 'fbq/rcu/lockdep.2023.03.27a' and 'fbq/rcu/rcutorture.2023.03.20a' into rcu/staging
2023-04-05rcu: Further comment and explain the state space of GP sequencesFrederic Weisbecker
The state space of the GP sequence number isn't documented and the definitions of its special values are scattered. This commit therefore gathers some common knowledge near the grace-period sequence-number definitions. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2023-04-04srcu: Begin offloading srcu_struct fields to srcu_updatePaul E. McKenney
The current srcu_struct structure is on the order of 200 bytes in size (depending on architecture and .config), which is much better than the old-style 26K bytes, but still all too inconvenient when one is trying to achieve good cache locality on a fastpath involving SRCU readers. However, only a few fields in srcu_struct are used by SRCU readers. The remaining fields could be offloaded to a new srcu_update structure, thus shrinking the srcu_struct structure down to a few tens of bytes. This commit begins this noble quest, a quest that is complicated by open-coded initialization of the srcu_struct within the srcu_notifier_head structure. This complication is addressed by updating the srcu_notifier_head structure's open coding, given that there does not appear to be a straightforward way of abstracting that initialization. This commit moves only the ->node pointer to srcu_update. Later commits will move additional fields. [ paulmck: Fold in qiang1.zhang@intel.com's memory-leak fix. ] Link: https://lore.kernel.org/all/20230320055751.4120251-1-qiang1.zhang@intel.com/ Suggested-by: Christoph Hellwig <hch@lst.de> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: "Michał Mirosław" <mirq-linux@rere.qmqm.pl> Cc: Dmitry Osipenko <dmitry.osipenko@collabora.com> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-02-02Merge branch 'stall.2023.01.09a' into HEADPaul E. McKenney
stall.2023.01.09a: RCU CPU stall-warning updates.
2023-01-17rcu: Track laziness during boot and suspendJoel Fernandes (Google)
Boot and suspend/resume should not be slowed down in kernels built with CONFIG_RCU_LAZY=y. In particular, suspend can sometimes fail in such kernels. This commit therefore adds rcu_async_hurry(), rcu_async_relax(), and rcu_async_should_hurry() functions that track whether or not either a boot or a suspend/resume operation is in progress. This will enable a later commit to refrain from laziness during those times. Export rcu_async_should_hurry(), rcu_async_hurry(), and rcu_async_relax() for later use by rcutorture. [ paulmck: Apply feedback from Steve Rostedt. ] Fixes: 3cb278e73be5 ("rcu: Make call_rcu() lazy to save power") Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-05rcu: Add RCU stall diagnosis informationZhen Lei
Because RCU CPU stall warnings are driven from the scheduling-clock interrupt handler, a workload consisting of a very large number of short-duration hardware interrupts can result in misleading stall-warning messages. On systems supporting only a single level of interrupts, that is, where interrupts handlers cannot be interrupted, this can produce misleading diagnostics. The stack traces will show the innocent-bystander interrupted task, not the interrupts that are at the very least exacerbating the stall. This situation can be improved by displaying the number of interrupts and the CPU time that they have consumed. Diagnosing other types of stalls can be eased by also providing the count of softirqs and the CPU time that they consumed as well as the number of context switches and the task-level CPU time consumed. Consider the following output given this change: rcu: INFO: rcu_preempt self-detected stall on CPU rcu: 0-....: (1250 ticks this GP) <omitted> rcu: hardirqs softirqs csw/system rcu: number: 624 45 0 rcu: cputime: 69 1 2425 ==> 2500(ms) This output shows that the number of hard and soft interrupts is small, there are no context switches, and the system takes up a lot of time. This indicates that the current task is looping with preemption disabled. The impact on system performance is negligible because snapshot is recorded only once for all continuous RCU stalls. This added debugging information is suppressed by default and can be enabled by building the kernel with CONFIG_RCU_CPU_STALL_CPUTIME=y or by booting with rcupdate.rcu_cpu_stall_cputime=1. Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-03rcu: Allow expedited RCU CPU stall warnings to dump task stacksPaul E. McKenney
This commit introduces the rcupdate.rcu_exp_stall_task_details kernel boot parameter, which cause expedited RCU CPU stall warnings to dump the stacks of any tasks blocking the current expedited grace period. Reported-by: David Howells <dhowells@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-12-12Merge tag 'printk-for-6.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux Pull printk updates from Petr Mladek: - Add NMI-safe SRCU reader API. It uses atomic_inc() instead of this_cpu_inc() on strong load-store architectures. - Introduce new console_list_lock to synchronize a manipulation of the list of registered consoles and their flags. This is a first step in removing the big-kernel-lock-like behavior of console_lock(). This semaphore still serializes console->write() calbacks against: - each other. It primary prevents potential races between early and proper console drivers using the same device. - suspend()/resume() callbacks and init() operations in some drivers. - various other operations in the tty/vt and framebufer susbsystems. It is likely that console_lock() serializes even operations that are not directly conflicting with the console->write() callbacks here. This is the most complicated big-kernel-lock aspect of the console_lock() that will be hard to untangle. - Introduce new console_srcu lock that is used to safely iterate and access the registered console drivers under SRCU read lock. This is a prerequisite for introducing atomic console drivers and console kthreads. It will reduce the complexity of serialization against normal consoles and console_lock(). Also it should remove the risk of deadlock during critical situations, like Oops or panic, when only atomic consoles are registered. - Check whether the console is registered instead of enabled on many locations. It was a historical leftover. - Cleanly force a preferred console in xenfb code instead of a dirty hack. - A lot of code and comment clean ups and improvements. * tag 'printk-for-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: (47 commits) printk: htmldocs: add missing description tty: serial: sh-sci: use setup() callback for early console printk: relieve console_lock of list synchronization duties tty: serial: kgdboc: use console_list_lock to trap exit tty: serial: kgdboc: synchronize tty_find_polling_driver() and register_console() tty: serial: kgdboc: use console_list_lock for list traversal tty: serial: kgdboc: use srcu console list iterator proc: consoles: use console_list_lock for list iteration tty: tty_io: use console_list_lock for list synchronization printk, xen: fbfront: create/use safe function for forcing preferred netconsole: avoid CON_ENABLED misuse to track registration usb: early: xhci-dbc: use console_is_registered() tty: serial: xilinx_uartps: use console_is_registered() tty: serial: samsung_tty: use console_is_registered() tty: serial: pic32_uart: use console_is_registered() tty: serial: earlycon: use console_is_registered() tty: hvc: use console_is_registered() efi: earlycon: use console_is_registered() tty: nfcon: use console_is_registered() serial_core: replace uart_console_enabled() with uart_console_registered() ...
2022-11-29rcu: Make SRCU mandatoryPaul E. McKenney
Kernels configured with CONFIG_PRINTK=n and CONFIG_SRCU=n get build failures. This causes trouble for deep embedded systems. But given that there are more than 25 instances of "select SRCU" in the kernel, it is hard to believe that there are many kernels running in production without SRCU. This commit therefore makes SRCU mandatory. The SRCU Kconfig option remains for backwards compatibility, and will be removed when it is no longer used. [ paulmck: Update per kernel test robot feedback. ] Reported-by: John Ogness <john.ogness@linutronix.de> Reported-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: <linux-arch@vger.kernel.org> Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested Reviewed-by: John Ogness <john.ogness@linutronix.de>
2022-11-29rcu: Make call_rcu() lazy to save powerJoel Fernandes (Google)
Implement timer-based RCU callback batching (also known as lazy callbacks). With this we save about 5-10% of power consumed due to RCU requests that happen when system is lightly loaded or idle. By default, all async callbacks (queued via call_rcu) are marked lazy. An alternate API call_rcu_hurry() is provided for the few users, for example synchronize_rcu(), that need the old behavior. The batch is flushed whenever a certain amount of time has passed, or the batch on a particular CPU grows too big. Also memory pressure will flush it in a future patch. To handle several corner cases automagically (such as rcu_barrier() and hotplug), we re-use bypass lists which were originally introduced to address lock contention, to handle lazy CBs as well. The bypass list length has the lazy CB length included in it. A separate lazy CB length counter is also introduced to keep track of the number of lazy CBs. [ paulmck: Fix formatting of inline call_rcu_lazy() definition. ] [ paulmck: Apply Zqiang feedback. ] [ paulmck: Apply s/call_rcu_flush/call_rcu_hurry/ feedback from Tejun Heo. ] Suggested-by: Paul McKenney <paulmck@kernel.org> Acked-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-07-21Merge branch 'ctxt.2022.07.05a' into HEADPaul E. McKenney
ctxt.2022.07.05a: Linux-kernel memory model development branch.
2022-07-05rcu/context_tracking: Move dynticks_nmi_nesting to context trackingFrederic Weisbecker
The RCU eqs tracking is going to be performed by the context tracking subsystem. The related nesting counters thus need to be moved to the context tracking structure. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com> Cc: Uladzislau Rezki <uladzislau.rezki@sony.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Nicolas Saenz Julienne <nsaenz@kernel.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com> Cc: Yu Liao <liaoyu15@huawei.com> Cc: Phil Auld <pauld@redhat.com> Cc: Paul Gortmaker<paul.gortmaker@windriver.com> Cc: Alex Belits <abelits@marvell.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
2022-06-20rcu: Provide a get_completed_synchronize_rcu() functionPaul E. McKenney
It is currently up to the caller to handle stale return values from get_state_synchronize_rcu(). If poll_state_synchronize_rcu() returned true once, a grace period has elapsed, regardless of the fact that counter wrap might cause some future poll_state_synchronize_rcu() invocation to return false. For example, the caller might store a separate flag that indicates whether some previous call to poll_state_synchronize_rcu() determined that the relevant grace period had already ended. This approach works, but it requires extra storage and is easy to get wrong. This commit therefore introduces a get_completed_synchronize_rcu() that returns a cookie that causes poll_state_synchronize_rcu() to always return true. This already-completed cookie can be stored in place of the cookie that previously caused poll_state_synchronize_rcu() to return true. It can also be used to flag a given structure as not having been exposed to readers, and thus not requiring a grace period to elapse. This commit is in preparation for polled expedited grace periods. Link: https://lore.kernel.org/all/20220121142454.1994916-1-bfoster@redhat.com/ Link: https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing Cc: Brian Foster <bfoster@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ian Kent <raven@themaw.net> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-06-20rcu: Make normal polling GP be more precise about sequence numbersPaul E. McKenney
Currently, poll_state_synchronize_rcu() uses rcu_seq_done() to check whether the specified grace period has completed. However, rcu_seq_done() does a simple comparison that reserves have of the sequence-number space for uncompleted grace periods. This has the unfortunate side-effect of not handling sequence-number wrap gracefully. Of course, one can argue that if someone has already waited for half of the full range of grace periods, they can wait for the other half, but why wait at all in this case? This commit therefore creates a rcu_seq_done_exact() that counts as uncompleted only the two grace periods during which the sequence number might have been handed out, while still being uncompleted. This way, if sequence-number wrap happens to hit that range, at most two additional grace periods need be waited for. This commit is in preparation for polled expedited grace periods. Link: https://lore.kernel.org/all/20220121142454.1994916-1-bfoster@redhat.com/ Link: https://docs.google.com/document/d/1RNKWW9jQyfjxw2E8dsXVTdvZYh0HnYeSHDKog9jhdN8/edit?usp=sharing Cc: Brian Foster <bfoster@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ian Kent <raven@themaw.net> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-05-26Merge tag 'sysctl-5.19-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux Pull sysctl updates from Luis Chamberlain: "For two kernel releases now kernel/sysctl.c has been being cleaned up slowly, since the tables were grossly long, sprinkled with tons of #ifdefs and all this caused merge conflicts with one susbystem or another. This tree was put together to help try to avoid conflicts with these cleanups going on different trees at time. So nothing exciting on this pull request, just cleanups. Thanks a lot to the Uniontech and Huawei folks for doing some of this nasty work" * tag 'sysctl-5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: (28 commits) sched: Fix build warning without CONFIG_SYSCTL reboot: Fix build warning without CONFIG_SYSCTL kernel/kexec_core: move kexec_core sysctls into its own file sysctl: minor cleanup in new_dir() ftrace: fix building with SYSCTL=y but DYNAMIC_FTRACE=n fs/proc: Introduce list_for_each_table_entry for proc sysctl mm: fix unused variable kernel warning when SYSCTL=n latencytop: move sysctl to its own file ftrace: fix building with SYSCTL=n but DYNAMIC_FTRACE=y ftrace: Fix build warning ftrace: move sysctl_ftrace_enabled to ftrace.c kernel/do_mount_initrd: move real_root_dev sysctls to its own file kernel/delayacct: move delayacct sysctls to its own file kernel/acct: move acct sysctls to its own file kernel/panic: move panic sysctls to its own file kernel/lockdep: move lockdep sysctls to its own file mm: move page-writeback sysctls to their own file mm: move oom_kill sysctls to their own file kernel/reboot: move reboot sysctls to its own file sched: Move energy_aware sysctls to topology.c ...
2022-05-11Merge branch 'exp.2022.05.11a' into HEADPaul E. McKenney
exp.2022.05.11a: Expedited-grace-period latency-reduction updates.
2022-05-11rcu: Move expedited grace period (GP) work to RT kthread_workerKalesh Singh
Enabling CONFIG_RCU_BOOST did not reduce RCU expedited grace-period latency because its workqueues run at SCHED_OTHER, and thus can be delayed by normal processes. This commit avoids these delays by moving the expedited GP work items to a real-time-priority kthread_worker. This option is controlled by CONFIG_RCU_EXP_KTHREAD and disabled by default on PREEMPT_RT=y kernels which disable expedited grace periods after boot by unconditionally setting rcupdate.rcu_normal_after_boot=1. The results were evaluated on arm64 Android devices (6GB ram) running 5.10 kernel, and capturing trace data in critical user-level code. The table below shows the resulting order-of-magnitude improvements in synchronize_rcu_expedited() latency: ------------------------------------------------------------------------ | | workqueues | kthread_worker | Diff | ------------------------------------------------------------------------ | Count | 725 | 688 | | ------------------------------------------------------------------------ | Min Duration (ns) | 326 | 447 | 37.12% | ------------------------------------------------------------------------ | Q1 (ns) | 39,428 | 38,971 | -1.16% | ------------------------------------------------------------------------ | Q2 - Median (ns) | 98,225 | 69,743 | -29.00% | ------------------------------------------------------------------------ | Q3 (ns) | 342,122 | 126,638 | -62.98% | ------------------------------------------------------------------------ | Max Duration (ns) | 372,766,967 | 2,329,671 | -99.38% | ------------------------------------------------------------------------ | Avg Duration (ns) | 2,746,353 | 151,242 | -94.49% | ------------------------------------------------------------------------ | Standard Deviation (ns) | 19,327,765 | 294,408 | | ------------------------------------------------------------------------ The below table show the range of maximums/minimums for synchronize_rcu_expedited() latency from all experiments: ------------------------------------------------------------------------ | | workqueues | kthread_worker | Diff | ------------------------------------------------------------------------ | Total No. of Experiments | 25 | 23 | | ------------------------------------------------------------------------ | Largest Maximum (ns) | 372,766,967 | 2,329,671 | -99.38% | ------------------------------------------------------------------------ | Smallest Maximum (ns) | 38,819 | 86,954 | 124.00% | ------------------------------------------------------------------------ | Range of Maximums (ns) | 372,728,148 | 2,242,717 | | ------------------------------------------------------------------------ | Largest Minimum (ns) | 88,623 | 27,588 | -68.87% | ------------------------------------------------------------------------ | Smallest Minimum (ns) | 326 | 447 | 37.12% | ------------------------------------------------------------------------ | Range of Minimums (ns) | 88,297 | 27,141 | | ------------------------------------------------------------------------ Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Tejun Heo <tj@kernel.org> Reported-by: Tim Murray <timmurray@google.com> Reported-by: Wei Wang <wvw@google.com> Tested-by: Kyle Lin <kylelin@google.com> Tested-by: Chunwei Lu <chunweilu@google.com> Tested-by: Lulu Wang <luluw@google.com> Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-05-11rcu: Introduce CONFIG_RCU_EXP_CPU_STALL_TIMEOUTUladzislau Rezki
Currently both expedited and regular grace period stall warnings use a single timeout value that with units of seconds. However, recent Android use cases problem require a sub-100-millisecond expedited RCU CPU stall warning. Given that expedited RCU grace periods normally complete in far less than a single millisecond, especially for small systems, this is not unreasonable. Therefore introduce the CONFIG_RCU_EXP_CPU_STALL_TIMEOUT kernel configuration that defaults to 20 msec on Android and remains the same as that of the non-expedited stall warnings otherwise. It also can be changed in run-time via: /sys/.../parameters/rcu_exp_cpu_stall_timeout. [ paulmck: Default of zero to use CONFIG_RCU_STALL_TIMEOUT. ] Signed-off-by: Uladzislau Rezki <uladzislau.rezki@sony.com> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>