summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-02-26Merge branches 'rcu-doc.2024.02.14a', 'rcu-nocb.2024.02.14a', ↵Boqun Feng
'rcu-exp.2024.02.14a', 'rcu-tasks.2024.02.26a' and 'rcu-misc.2024.02.14a' into rcu.2024.02.26a
2024-02-25rcu-tasks: Maintain real-time response in rcu_tasks_postscan()Paul E. McKenney
The current code will scan the entirety of each per-CPU list of exiting tasks in ->rtp_exit_list with interrupts disabled. This is normally just fine, because each CPU typically won't have very many tasks in this state. However, if a large number of tasks block late in do_exit(), these lists could be arbitrarily long. Low probability, perhaps, but it really could happen. This commit therefore occasionally re-enables interrupts while traversing these lists, inserting a dummy element to hold the current place in the list. In kernels built with CONFIG_PREEMPT_RT=y, this re-enabling happens after each list element is processed, otherwise every one-to-two jiffies. [ paulmck: Apply Frederic Weisbecker feedback. ] Link: https://lore.kernel.org/all/ZdeI_-RfdLR8jlsm@localhost.localdomain/ Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Sebastian Siewior <bigeasy@linutronix.de> Cc: Anna-Maria Behnsen <anna-maria@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-25rcu-tasks: Eliminate deadlocks involving do_exit() and RCU tasksPaul E. McKenney
Holding a mutex across synchronize_rcu_tasks() and acquiring that same mutex in code called from do_exit() after its call to exit_tasks_rcu_start() but before its call to exit_tasks_rcu_stop() results in deadlock. This is by design, because tasks that are far enough into do_exit() are no longer present on the tasks list, making it a bit difficult for RCU Tasks to find them, let alone wait on them to do a voluntary context switch. However, such deadlocks are becoming more frequent. In addition, lockdep currently does not detect such deadlocks and they can be difficult to reproduce. In addition, if a task voluntarily context switches during that time (for example, if it blocks acquiring a mutex), then this task is in an RCU Tasks quiescent state. And with some adjustments, RCU Tasks could just as well take advantage of that fact. This commit therefore eliminates these deadlock by replacing the SRCU-based wait for do_exit() completion with per-CPU lists of tasks currently exiting. A given task will be on one of these per-CPU lists for the same period of time that this task would previously have been in the previous SRCU read-side critical section. These lists enable RCU Tasks to find the tasks that have already been removed from the tasks list, but that must nevertheless be waited upon. The RCU Tasks grace period gathers any of these do_exit() tasks that it must wait on, and adds them to the list of holdouts. Per-CPU locking and get_task_struct() are used to synchronize addition to and removal from these lists. Link: https://lore.kernel.org/all/20240118021842.290665-1-chenzhongjin@huawei.com/ Reported-by: Chen Zhongjin <chenzhongjin@huawei.com> Reported-by: Yang Jihong <yangjihong1@huawei.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: Yang Jihong <yangjihong1@huawei.com> Tested-by: Chen Zhongjin <chenzhongjin@huawei.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-25rcu-tasks: Maintain lists to eliminate RCU-tasks/do_exit() deadlocksPaul E. McKenney
This commit continues the elimination of deadlocks involving do_exit() and RCU tasks by causing exit_tasks_rcu_start() to add the current task to a per-CPU list and causing exit_tasks_rcu_stop() to remove the current task from whatever list it is on. These lists will be used to track tasks that are exiting, while still accounting for any RCU-tasks quiescent states that these tasks pass though. [ paulmck: Apply Frederic Weisbecker feedback. ] Link: https://lore.kernel.org/all/20240118021842.290665-1-chenzhongjin@huawei.com/ Reported-by: Chen Zhongjin <chenzhongjin@huawei.com> Reported-by: Yang Jihong <yangjihong1@huawei.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: Yang Jihong <yangjihong1@huawei.com> Tested-by: Chen Zhongjin <chenzhongjin@huawei.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-25rcu-tasks: Initialize data to eliminate RCU-tasks/do_exit() deadlocksPaul E. McKenney
Holding a mutex across synchronize_rcu_tasks() and acquiring that same mutex in code called from do_exit() after its call to exit_tasks_rcu_start() but before its call to exit_tasks_rcu_stop() results in deadlock. This is by design, because tasks that are far enough into do_exit() are no longer present on the tasks list, making it a bit difficult for RCU Tasks to find them, let alone wait on them to do a voluntary context switch. However, such deadlocks are becoming more frequent. In addition, lockdep currently does not detect such deadlocks and they can be difficult to reproduce. In addition, if a task voluntarily context switches during that time (for example, if it blocks acquiring a mutex), then this task is in an RCU Tasks quiescent state. And with some adjustments, RCU Tasks could just as well take advantage of that fact. This commit therefore initializes the data structures that will be needed to rely on these quiescent states and to eliminate these deadlocks. Link: https://lore.kernel.org/all/20240118021842.290665-1-chenzhongjin@huawei.com/ Reported-by: Chen Zhongjin <chenzhongjin@huawei.com> Reported-by: Yang Jihong <yangjihong1@huawei.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: Yang Jihong <yangjihong1@huawei.com> Tested-by: Chen Zhongjin <chenzhongjin@huawei.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-25rcu-tasks: Initialize callback lists at rcu_init() timePaul E. McKenney
In order for RCU Tasks to reliably maintain per-CPU lists of exiting tasks, those lists must be initialized before it is possible for tasks to exit, especially given that the boot CPU is not necessarily CPU 0 (an example being, powerpc kexec() kernels). And at the time that rcu_init_tasks_generic() is called, a task could potentially exit, unconventional though that sort of thing might be. This commit therefore moves the calls to cblist_init_generic() from functions called from rcu_init_tasks_generic() to a new function named tasks_cblist_init_generic() that is invoked from rcu_init(). This constituted a bug in a commit that never went to mainline, so there is no need for any backporting to -stable. Reported-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-25rcu-tasks: Add data to eliminate RCU-tasks/do_exit() deadlocksPaul E. McKenney
Holding a mutex across synchronize_rcu_tasks() and acquiring that same mutex in code called from do_exit() after its call to exit_tasks_rcu_start() but before its call to exit_tasks_rcu_stop() results in deadlock. This is by design, because tasks that are far enough into do_exit() are no longer present on the tasks list, making it a bit difficult for RCU Tasks to find them, let alone wait on them to do a voluntary context switch. However, such deadlocks are becoming more frequent. In addition, lockdep currently does not detect such deadlocks and they can be difficult to reproduce. In addition, if a task voluntarily context switches during that time (for example, if it blocks acquiring a mutex), then this task is in an RCU Tasks quiescent state. And with some adjustments, RCU Tasks could just as well take advantage of that fact. This commit therefore adds the data structures that will be needed to rely on these quiescent states and to eliminate these deadlocks. Link: https://lore.kernel.org/all/20240118021842.290665-1-chenzhongjin@huawei.com/ Reported-by: Chen Zhongjin <chenzhongjin@huawei.com> Reported-by: Yang Jihong <yangjihong1@huawei.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: Yang Jihong <yangjihong1@huawei.com> Tested-by: Chen Zhongjin <chenzhongjin@huawei.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14rcu-tasks: Repair RCU Tasks Trace quiescence checkPaul E. McKenney
The context-switch-time check for RCU Tasks Trace quiescence expects current->trc_reader_special.b.need_qs to be zero, and if so, updates it to TRC_NEED_QS_CHECKED. This is backwards, because if this value is zero, there is no RCU Tasks Trace grace period in flight, an thus no need for a quiescent state. Instead, when a grace period starts, this field is set to TRC_NEED_QS. This commit therefore changes the check from zero to TRC_NEED_QS. Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14rcu/sync: remove un-used rcu_sync_enter_start functionOnkarnath
With commit '6a010a49b63a ("cgroup: Make !percpu threadgroup_rwsem operations optional")' usage of rcu_sync_enter_start is removed. So this function can also be removed. In the words of Oleg Nesterov: __rcu_sync_enter(wait => false) is a better alternative if someone needs rcu_sync_enter_start() again. Link: https://lore.kernel.org/all/20220725121208.GB28662@redhat.com/ Signed-off-by: Onkarnath <onkarnath.1@samsung.com> Signed-off-by: Maninder Singh <maninder1.s@samsung.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14rcutorture: Suppress rtort_pipe_count warnings until after stallsPaul E. McKenney
Currently, if rcu_torture_writer() sees fewer than ten grace periods having elapsed during a call to stutter_wait() that actually waited, the rtort_pipe_count warning is emitted. This has worked well for a long time. Except that the rcutorture TREE07 scenario now does a short-term 14-second RCU CPU stall, which can most definitely case false-positive rtort_pipe_count warnings. This commit therefore changes rcu_torture_writer() to compute the full expected holdoff and stall duration, and to refuse to report any rtort_pipe_count warnings until after all stalls have completed. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14srcu: Improve comments about acceleration leakJoel Fernandes (Google)
The comments added in commit 1ef990c4b36b ("srcu: No need to advance/accelerate if no callback enqueued") are a bit confusing. The comments are describing a scenario for code that was moved and is no longer the way it was (snapshot after advancing). Improve the code comments to reflect this and also document why acceleration can never fail. Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14rcu: Provide a boot time parameter to control lazy RCUQais Yousef
To allow more flexible arrangements while still provide a single kernel for distros, provide a boot time parameter to enable/disable lazy RCU. Specify: rcutree.enable_rcu_lazy=[y|1|n|0] Which also requires rcu_nocbs=all at boot time to enable/disable lazy RCU. To disable it by default at build time when CONFIG_RCU_LAZY=y, the new CONFIG_RCU_LAZY_DEFAULT_OFF can be used. Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io> Tested-by: Andrea Righi <andrea.righi@canonical.com> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14rcu: Rename jiffies_till_flush to jiffies_lazy_flushFrederic Weisbecker
The variable name jiffies_till_flush is too generic and therefore: * It may shadow a global variable * It doesn't tell on what it operates Make the name more precise, along with the related APIs. Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14doc: Update checklist.rst discussion of callback executionPaul E. McKenney
This commit completes the list of call_rcu*() functions that are not guaranteed to have their callbacks executing on the same CPU. While in the area, fix an unrelated typo. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14doc: Clarify use of slab constructors and SLAB_TYPESAFE_BY_RCUPaul E. McKenney
This commit explicitly states that you should initialize any locks to be used by readers in your SLAB_TYPESAFE_BY_RCU constructor. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14context_tracking: Fix kerneldoc headers for __ct_user_{enter,exit}()Paul E. McKenney
Document the "state" parameter of both of these functions. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202312041922.YZCcEPYD-lkp@intel.com/ Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Randy Dunlap <rdunlap@infradead.org> Cc: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14doc: Add EARLY flag to early-parsed kernel boot parametersPaul E. McKenney
Kernel boot parameters declared with early_param() are parsed before embedded parameters are extracted from initrd, and early_param() parameters are not helpful when embedded in initrd. Therefore, mark early_param() kernel boot parameters with "EARLY" in kernel-parameters.txt. The following early_param() calls declare kernel boot parameters that are undocumented: early_param("atmel.pm_modes", at91_pm_modes_select); early_param("mem_fclk_21285", early_fclk); early_param("ecc", early_ecc); early_param("cachepolicy", early_cachepolicy); early_param("nodebugmon", early_debug_disable); early_param("kfence.sample_interval", parse_kfence_early_init); early_param("additional_cpus", setup_additional_cpus); early_param("stram_pool", atari_stram_setup); early_param("disable_octeon_edac", disable_octeon_edac); early_param("rd_start", rd_start_early); early_param("rd_size", rd_size_early); early_param("coherentio", setcoherentio); early_param("nocoherentio", setnocoherentio); early_param("fadump", early_fadump_param); early_param("fadump_reserve_mem", early_fadump_reserve_mem); early_param("no_stf_barrier", handle_no_stf_barrier); early_param("no_rfi_flush", handle_no_rfi_flush); early_param("smt-enabled", early_smt_enabled); early_param("ppc_pci_reset_phbs", pci_reset_phbs_setup); early_param("ps3fb", early_parse_ps3fb); early_param("ps3flash", early_parse_ps3flash); early_param("novx", disable_vector_extension); early_param("nobp", nobp_setup_early); early_param("nospec", nospec_setup_early); early_param("possible_cpus", _setup_possible_cpus); early_param("stp", early_parse_stp); early_param("nopfault", nopfault); early_param("nmi_mode", nmi_mode_setup); early_param("sh_mv", early_parse_mv); early_param("pmb", early_pmb); early_param("hvirq", early_hvirq_major); early_param("cfi", cfi_parse_cmdline); early_param("disableapic", setup_disableapic); early_param("noapictimer", parse_disable_apic_timer); early_param("disable_cpu_apicid", apic_set_disabled_cpu_apicid); early_param("uv_memblksize", parse_mem_block_size); early_param("retbleed", retbleed_parse_cmdline); early_param("no-kvmclock-vsyscall", parse_no_kvmclock_vsyscall); early_param("update_mptable", update_mptable_setup); early_param("alloc_mptable", parse_alloc_mptable_opt); early_param("possible_cpus", _setup_possible_cpus); early_param("lsmsi", early_parse_ls_scfg_msi); early_param("nokgdbroundup", opt_nokgdbroundup); early_param("kgdbcon", opt_kgdb_con); early_param("kasan", early_kasan_flag); early_param("kasan.mode", early_kasan_mode); early_param("kasan.vmalloc", early_kasan_flag_vmalloc); early_param("kasan.page_alloc.sample", early_kasan_flag_page_alloc_sample); early_param("kasan.page_alloc.sample.order", early_kasan_flag_page_alloc_sample_order); early_param("kasan.fault", early_kasan_fault); early_param("kasan.stacktrace", early_kasan_flag_stacktrace); early_param("kasan.stack_ring_size", early_kasan_flag_stack_ring_size); early_param("accept_memory", accept_memory_parse); early_param("page_table_check", early_page_table_check_param); sh_early_platform_init("earlytimer", &sh_cmt_device_driver); early_param_on_off("gbpages", "nogbpages", direct_gbpages, CONFIG_X86_DIRECT_GBPAGES); These are not necessarily bugs, given that some kernel boot parameters are intended for deep debugging rather than general use. This work does not cover all of the kernel boot parameters declared using cmdline_find_option() and cmdline_find_option_bool(). If these are in fact guaranteed to be early (which appears to be the case), they can be added in a later version of this patch. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Petr Malat <oss@malat.biz> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: <linux-doc@vger.kernel.org> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14doc: Add CONFIG_RCU_STRICT_GRACE_PERIOD to checklist.rstPaul E. McKenney
This commit adds CONFIG_RCU_STRICT_GRACE_PERIOD to the list of debugging Kconfig options in checklist.rst. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14doc: Make checklist.rst note that spinlocks are implied RCU readersPaul E. McKenney
In kernels built with CONFIG_PREEMPT_RT=n, spinlock critical sections are RCU readers because they disable preemption. However, they are also RCU readers in CONFIG_PREEMPT_RT=y because in that case the locking primitives contain rcu_read_lock() and rcu_read_unlock(). Therefore, upgrade checklist.rst to document this non-obvious case. While in the area, fix a typo by changing "read-side critical" to "read-side critical section". Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14doc: Make whatisRCU.rst note that spinlocks are RCU readersPaul E. McKenney
In kernels built with CONFIG_PREEMPT_RT=n, spinlock critical sections are RCU readers because they disable preemption. However, they are also RCU readers in CONFIG_PREEMPT_RT=y because in that case the locking primitives contain rcu_read_lock() and rcu_read_unlock(). Therefore, upgrade whatisRCU.rst to document this non-obvious case. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14doc: Spinlocks are implied RCU readersPaul E. McKenney
In kernels built with CONFIG_PREEMPT_RT=n, spinlock critical sections are RCU readers because they disable preemption. However, they are also RCU readers in CONFIG_PREEMPT_RT=y because the -rt locking primitives contain rcu_read_lock() and rcu_read_unlock(). Therefore, upgrade rcu_dereference.rst to document this non-obvious case. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Closes: https://lore.kernel.org/lkml/CAHk-=whGKvjHCtJ6W4pQ0_h_k9fiFQ8V2GpM=BqYnB2X=SJ+XQ@mail.gmail.com/ Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14rcu/exp: Remove rcu_par_gp_wqFrederic Weisbecker
TREE04 running on short iterations can produce writer stalls of the following kind: ??? Writer stall state RTWS_EXP_SYNC(4) g3968 f0x0 ->state 0x2 cpu 0 task:rcu_torture_wri state:D stack:14568 pid:83 ppid:2 flags:0x00004000 Call Trace: <TASK> __schedule+0x2de/0x850 ? trace_event_raw_event_rcu_exp_funnel_lock+0x6d/0xb0 schedule+0x4f/0x90 synchronize_rcu_expedited+0x430/0x670 ? __pfx_autoremove_wake_function+0x10/0x10 ? __pfx_synchronize_rcu_expedited+0x10/0x10 do_rtws_sync.constprop.0+0xde/0x230 rcu_torture_writer+0x4b4/0xcd0 ? __pfx_rcu_torture_writer+0x10/0x10 kthread+0xc7/0xf0 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x2f/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1b/0x30 </TASK> Waiting for an expedited grace period and polling for an expedited grace period both are operations that internally rely on the same workqueue performing necessary asynchronous work. However, a dependency chain is involved between those two operations, as depicted below: ====== CPU 0 ======= ====== CPU 1 ======= synchronize_rcu_expedited() exp_funnel_lock() mutex_lock(&rcu_state.exp_mutex); start_poll_synchronize_rcu_expedited queue_work(rcu_gp_wq, &rnp->exp_poll_wq); synchronize_rcu_expedited_queue_work() queue_work(rcu_gp_wq, &rew->rew_work); wait_event() // A, wait for &rew->rew_work completion mutex_unlock() // B //======> switch to kworker sync_rcu_do_polled_gp() { synchronize_rcu_expedited() exp_funnel_lock() mutex_lock(&rcu_state.exp_mutex); // C, wait B .... } // D Since workqueues are usually implemented on top of several kworkers handling the queue concurrently, the above situation wouldn't deadlock most of the time because A then doesn't depend on D. But in case of memory stress, a single kworker may end up handling alone all the works in a serialized way. In that case the above layout becomes a problem because A then waits for D, closing a circular dependency: A -> D -> C -> B -> A This however only happens when CONFIG_RCU_EXP_KTHREAD=n. Indeed synchronize_rcu_expedited() is otherwise implemented on top of a kthread worker while polling still relies on rcu_gp_wq workqueue, breaking the above circular dependency chain. Fix this with making expedited grace period to always rely on kthread worker. The workqueue based implementation is essentially a duplicate anyway now that the per-node initialization is performed by per-node kthread workers. Meanwhile the CONFIG_RCU_EXP_KTHREAD switch is still kept around to manage the scheduler policy of these kthread workers. Reported-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Reported-by: Thomas Gleixner <tglx@linutronix.de> Suggested-by: Joel Fernandes <joel@joelfernandes.org> Suggested-by: Paul E. McKenney <paulmck@kernel.org> Suggested-by: Neeraj upadhyay <Neeraj.Upadhyay@amd.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14rcu/exp: Handle parallel exp gp kworkers affinityFrederic Weisbecker
Affine the parallel expedited gp kworkers to their respective RCU node in order to make them close to the cache their are playing with. This reuses the boost kthreads machinery that probe into CPU hotplug operations such that the kthreads become/stay affine to their respective node as soon/long as they contain online CPUs. Otherwise and if the current CPU going down was the last online on the leaf node, the related kthread is affine to the housekeeping CPUs. In the long run, this affinity VS CPU hotplug operation game should probably be implemented at the generic kthread level. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> [boqun: s/* rcu_boost_task/*rcu_boost_task as reported by checkpatch] Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14rcu/exp: Make parallel exp gp kworker per rcu nodeFrederic Weisbecker
When CONFIG_RCU_EXP_KTHREAD=n, the expedited grace period per node initialization is performed in parallel via workqueues (one work per node). However in CONFIG_RCU_EXP_KTHREAD=y, this per node initialization is performed by a single kworker serializing each node initialization (one work for all nodes). The second part is certainly less scalable and efficient beyond a single leaf node. To improve this, expand this single kworker into per-node kworkers. This new layout is eventually intended to remove the workqueues based implementation since it will essentially now become duplicate code. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14rcu/exp: Move expedited kthread worker creation functions above ↵Frederic Weisbecker
rcutree_prepare_cpu() The expedited kthread worker performing the per node initialization is going to be split into per node kthreads. As such, the future per node kthread creation will need to be called from CPU hotplug callbacks instead of an initcall, right beside the per node boost kthread creation. To prepare for that, move the kthread worker creation above rcutree_prepare_cpu() as a first step to make the review smoother for the upcoming modifications. No intended functional change. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14rcu: s/boost_kthread_mutex/kthread_mutexFrederic Weisbecker
This mutex is currently protecting per node boost kthreads creation and affinity setting across CPU hotplug operations. Since the expedited kworkers will soon be split per node as well, they will be subject to the same concurrency constraints against hotplug. Therefore their creation and affinity tuning operations will be grouped with those of boost kthreads and then rely on the same mutex. To prepare for that, generalize its name. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14rcu/exp: Handle RCU expedited grace period kworker allocation failureFrederic Weisbecker
Just like is done for the kworker performing nodes initialization, gracefully handle the possible allocation failure of the RCU expedited grace period main kworker. While at it perform a rename of the related checking functions to better reflect the expedited specifics. Reviewed-by: Kalesh Singh <kaleshsingh@google.com> Fixes: 9621fbee44df ("rcu: Move expedited grace period (GP) work to RT kthread_worker") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14rcu/exp: Fix RCU expedited parallel grace period kworker allocation failure ↵Frederic Weisbecker
recovery Under CONFIG_RCU_EXP_KTHREAD=y, the nodes initialization for expedited grace periods is queued to a kworker. However if the allocation of that kworker failed, the nodes initialization is performed synchronously by the caller instead. Now the check for kworker initialization failure relies on the kworker pointer to be NULL while its value might actually encapsulate an allocation failure error. Make sure to handle this case. Reviewed-by: Kalesh Singh <kaleshsingh@google.com> Fixes: 9621fbee44df ("rcu: Move expedited grace period (GP) work to RT kthread_worker") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14rcu/exp: Remove full barrier upon main thread wakeupFrederic Weisbecker
When an expedited grace period is ending, care must be taken so that all the quiescent states propagated up to the root are correctly ordered against the wake up of the main expedited grace period workqueue. This ordering is already carried through the root rnp locking augmented by an smp_mb__after_unlock_lock() barrier. Therefore the explicit smp_mb() placed before the wake up is not needed and can be removed. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14rcu/nocb: Check rdp_gp->nocb_timer in __call_rcu_nocb_wake()Zqiang
Currently, only rdp_gp->nocb_timer is used, for nocb_timer of no-rdp_gp structure, the timer_pending() is always return false, this commit therefore need to check rdp_gp->nocb_timer in __call_rcu_nocb_wake(). Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14rcu/nocb: Fix WARN_ON_ONCE() in the rcu_nocb_bypass_lock()Zqiang
For the kernels built with CONFIG_RCU_NOCB_CPU_DEFAULT_ALL=y and CONFIG_RCU_LAZY=y, the following scenarios will trigger WARN_ON_ONCE() in the rcu_nocb_bypass_lock() and rcu_nocb_wait_contended() functions: CPU2 CPU11 kthread rcu_nocb_cb_kthread ksys_write rcu_do_batch vfs_write rcu_torture_timer_cb proc_sys_write __kmem_cache_free proc_sys_call_handler kmemleak_free drop_caches_sysctl_handler delete_object_full drop_slab __delete_object shrink_slab put_object lazy_rcu_shrink_scan call_rcu rcu_nocb_flush_bypass __call_rcu_commn rcu_nocb_bypass_lock raw_spin_trylock(&rdp->nocb_bypass_lock) fail atomic_inc(&rdp->nocb_lock_contended); rcu_nocb_wait_contended WARN_ON_ONCE(smp_processor_id() != rdp->cpu); WARN_ON_ONCE(atomic_read(&rdp->nocb_lock_contended)) | |_ _ _ _ _ _ _ _ _ _same rdp and rdp->cpu != 11_ _ _ _ _ _ _ _ _ __| Reproduce this bug with "echo 3 > /proc/sys/vm/drop_caches". This commit therefore uses rcu_nocb_try_flush_bypass() instead of rcu_nocb_flush_bypass() in lazy_rcu_shrink_scan(). If the nocb_bypass queue is being flushed, then rcu_nocb_try_flush_bypass will return directly. Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14rcu/nocb: Re-arrange call_rcu() NOCB specific codeFrederic Weisbecker
Currently the call_rcu() function interleaves NOCB and !NOCB enqueue code in a complicated way such that: * The bypass enqueue code may or may not have enqueued and may or may not have locked the ->nocb_lock. Everything that follows is in a Schrödinger locking state for the unwary reviewer's eyes. * The was_alldone is always set but only used in NOCB related code. * The NOCB wake up is distantly related to the locking hopefully performed by the bypass enqueue code that did not enqueue on the bypass list. Unconfuse the whole and gather NOCB and !NOCB specific enqueue code to their own functions. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14rcu/nocb: Make IRQs disablement symmetricFrederic Weisbecker
Currently IRQs are disabled on call_rcu() and then depending on the context: * If the CPU is in nocb mode: - If the callback is enqueued in the bypass list, IRQs are re-enabled implictly by rcu_nocb_try_bypass() - If the callback is enqueued in the normal list, IRQs are re-enabled implicitly by __call_rcu_nocb_wake() * If the CPU is NOT in nocb mode, IRQs are reenabled explicitly from call_rcu() This makes the code a bit hard to follow, especially as it interleaves with nocb locking. To make the IRQ flags coverage clearer and also in order to prepare for moving all the nocb enqueue code to its own function, always re-enable the IRQ flags explicitly from call_rcu(). Reviewed-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14rcu/nocb: Remove needless full barrier after callback advancingFrederic Weisbecker
A full barrier is issued from nocb_gp_wait() upon callbacks advancing to order grace period completion with callbacks execution. However these two events are already ordered by the smp_mb__after_unlock_lock() barrier within the call to raw_spin_lock_rcu_node() that is necessary for callbacks advancing to happen. The following litmus test shows the kind of guarantee that this barrier provides: C smp_mb__after_unlock_lock {} // rcu_gp_cleanup() P0(spinlock_t *rnp_lock, int *gpnum) { // Grace period cleanup increase gp sequence number spin_lock(rnp_lock); WRITE_ONCE(*gpnum, 1); spin_unlock(rnp_lock); } // nocb_gp_wait() P1(spinlock_t *rnp_lock, spinlock_t *nocb_lock, int *gpnum, int *cb_ready) { int r1; // Call rcu_advance_cbs() from nocb_gp_wait() spin_lock(nocb_lock); spin_lock(rnp_lock); smp_mb__after_unlock_lock(); r1 = READ_ONCE(*gpnum); WRITE_ONCE(*cb_ready, 1); spin_unlock(rnp_lock); spin_unlock(nocb_lock); } // nocb_cb_wait() P2(spinlock_t *nocb_lock, int *cb_ready, int *cb_executed) { int r2; // rcu_do_batch() -> rcu_segcblist_extract_done_cbs() spin_lock(nocb_lock); r2 = READ_ONCE(*cb_ready); spin_unlock(nocb_lock); // Actual callback execution WRITE_ONCE(*cb_executed, 1); } P3(int *cb_executed, int *gpnum) { int r3; WRITE_ONCE(*cb_executed, 2); smp_mb(); r3 = READ_ONCE(*gpnum); } exists (1:r1=1 /\ 2:r2=1 /\ cb_executed=2 /\ 3:r3=0) (* Bad outcome. *) Here the bad outcome only occurs if the smp_mb__after_unlock_lock() is removed. This barrier orders the grace period completion against callbacks advancing and even later callbacks invocation, thanks to the opportunistic propagation via the ->nocb_lock to nocb_cb_wait(). Therefore the smp_mb() placed after callbacks advancing can be safely removed. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14rcu/nocb: Remove needless LOAD-ACQUIREFrederic Weisbecker
The LOAD-ACQUIRE access performed on rdp->nocb_cb_sleep advertizes ordering callback execution against grace period completion. However this is contradicted by the following: * This LOAD-ACQUIRE doesn't pair with anything. The only counterpart barrier that can be found is the smp_mb() placed after callbacks advancing in nocb_gp_wait(). However the barrier is placed _after_ ->nocb_cb_sleep write. * Callbacks can be concurrently advanced between the LOAD-ACQUIRE on ->nocb_cb_sleep and the call to rcu_segcblist_extract_done_cbs() in rcu_do_batch(), making any ordering based on ->nocb_cb_sleep broken. * Both rcu_segcblist_extract_done_cbs() and rcu_advance_cbs() are called under the nocb_lock, the latter hereby providing already the desired ACQUIRE semantics. Therefore it is safe to access ->nocb_cb_sleep with a simple compiler barrier. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-01-28Linux 6.8-rc2v6.8-rc2Linus Torvalds
2024-01-28Merge tag 'cxl-fixes-6.8-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl Pull cxl fixes from Dan Williams: "A build regression fix, a device compatibility fix, and an original bug preventing creation of large (16 device) interleave sets: - Fix unit test build regression fallout from global "missing-prototypes" change - Fix compatibility with devices that do not support interrupts - Fix overflow when calculating the capacity of large interleave sets" * tag 'cxl-fixes-6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: cxl/region:Fix overflow issue in alloc_hpa() cxl/pci: Skip irq features if MSI/MSI-X are not supported tools/testing/nvdimm: Disable "missing prototypes / declarations" warnings tools/testing/cxl: Disable "missing prototypes / declarations" warnings
2024-01-28Merge tag 'mips-fixes_6.8_1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux Pull MIPS fixes from Thomas Bogendoerfer: - fix boot issue on single core Lantiq Danube devices - fix boot issue on Loongson64 platforms - fix improper FPU setup - fix missing prototypes issues * tag 'mips-fixes_6.8_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux: mips: Call lose_fpu(0) before initializing fcr31 in mips_set_personality_nan MIPS: loongson64: set nid for reserved memblock region Revert "MIPS: loongson64: set nid for reserved memblock region" MIPS: lantiq: register smp_ops on non-smp platforms MIPS: loongson64: set nid for reserved memblock region MIPS: reserve exception vector space ONLY ONCE MIPS: BCM63XX: Fix missing prototypes MIPS: sgi-ip32: Fix missing prototypes MIPS: sgi-ip30: Fix missing prototypes MIPS: fw arc: Fix missing prototypes MIPS: sgi-ip27: Fix missing prototypes MIPS: Alchemy: Fix missing prototypes MIPS: Cobalt: Fix missing prototypes
2024-01-28Merge tag 'locking_urgent_for_v6.8_rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fix from Borislav Petkov: - Prevent an inconsistent futex operation leading to stale state exposure * tag 'locking_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: futex: Prevent the reuse of stale pi_state
2024-01-28Merge tag 'irq_urgent_for_v6.8_rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fix from Borislav Petkov: - Initialize the resend node of each IRQ descriptor, not only the first one * tag 'irq_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq: Initialize resend_node hlist for all interrupt descriptors
2024-01-28Merge tag 'timers_urgent_for_v6.8_rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Borislav Petkov: - Preserve the number of idle calls and sleep entries across CPU hotplug events in order to be able to compute correct averages - Limit the duration of the clocksource watchdog checking interval as too long intervals lead to wrongly marking the TSC as unstable * tag 'timers_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: tick/sched: Preserve number of idle sleeps across CPU hotplug events clocksource: Skip watchdog check for large watchdog intervals
2024-01-28Merge tag 'x86_urgent_for_v6.8_rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Make sure 32-bit syscall registers are properly sign-extended - Add detection for AMD's Zen5 generation CPUs and Intel's Clearwater Forest CPU model number - Make a stub function export non-GPL because it is part of the paravirt alternatives and that can be used by non-GPL code * tag 'x86_urgent_for_v6.8_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/CPU/AMD: Add more models to X86_FEATURE_ZEN5 x86/entry/ia32: Ensure s32 is sign extended to s64 x86/cpu: Add model number for Intel Clearwater Forest processor x86/CPU/AMD: Add X86_FEATURE_ZEN5 x86/paravirt: Make BUG_func() usable by non-GPL modules
2024-01-28Merge tag 'fixes-2024-01-28' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock Pull memblock fix from Mike Rapoport: "Fix crash when reserved memory is not added to memory. When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, the initialization of reserved pages may cause access of NODE_DATA() with invalid nid and crash. Add a fall back to early_pfn_to_nid() in memmap_init_reserved_pages() to ensure a valid node id is always passed to init_reserved_page()" * tag 'fixes-2024-01-28' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock: memblock: fix crash when reserved memory is not added to memory
2024-01-27Merge tag 'platform-drivers-x86-v6.8-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86 Pull x86 platform driver fixes from Hans de Goede: - WMI bus driver fixes - Second attempt (previously reverted) at P2SB PCI rescan deadlock fix - AMD PMF driver improvements - MAINTAINERS updates - Misc other small fixes and hw-id additions * tag 'platform-drivers-x86-v6.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86: platform/x86: touchscreen_dmi: Add info for the TECLAST X16 Plus tablet platform/x86/intel/ifs: Call release_firmware() when handling errors. platform/x86/amd/pmf: Fix memory leak in amd_pmf_get_pb_data() platform/x86/amd/pmf: Get ambient light information from AMD SFH driver platform/x86/amd/pmf: Get Human presence information from AMD SFH driver platform/mellanox: mlxbf-pmc: Fix offset calculation for crspace events platform/mellanox: mlxbf-tmfifo: Drop Tx network packet when Tx TmFIFO is full MAINTAINERS: remove defunct acpi4asus project info from asus notebooks section MAINTAINERS: add Luke Jones as maintainer for asus notebooks MAINTAINERS: Remove Perry Yuan as DELL WMI HARDWARE PRIVACY SUPPORT maintainer platform/x86: silicom-platform: Add missing "Description:" for power_cycle sysfs attr platform/x86: intel-wmi-sbl-fw-update: Fix function name in error message platform/x86: p2sb: Use pci_resource_n() in p2sb_read_bar0() platform/x86: p2sb: Allow p2sb_bar() calls during PCI device probe platform/x86: intel-uncore-freq: Fix types in sysfs callbacks platform/x86: wmi: Fix wmi_dev_probe() platform/x86: wmi: Fix notify callback locking platform/x86: wmi: Decouple legacy WMI notify handlers from wmi_block_list platform/x86: wmi: Return immediately if an suitable WMI event is found platform/x86: wmi: Fix error handling in legacy WMI notify handler functions
2024-01-27Merge tag 'loongarch-fixes-6.8-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson Pull LoongArch fixes from Huacai Chen: "Fix boot failure on machines with more than 8 nodes, and fix two build errors about KVM" * tag 'loongarch-fixes-6.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson: LoongArch: KVM: Add returns to SIMD stubs LoongArch: KVM: Fix build due to API changes LoongArch/smp: Call rcutree_report_cpu_starting() at tlb_init()
2024-01-27Merge tag 'xfs-6.8-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs fix from Chandan Babu: - Fix read only mounts when using fsopen mount API * tag 'xfs-6.8-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: read only mounts with fsopen mount API are busted
2024-01-27Merge tag 'bcachefs-2024-01-26' of https://evilpiepirate.org/git/bcachefsLinus Torvalds
Pull bcachefs fixes from Kent Overstreet: - fix for REQ_OP_FLUSH usage; this fixes filesystems going read only with -EOPNOTSUPP from the block layer. (this really should have gone in with the block layer patch causing the -EOPNOTSUPP, or should have gone in before). - fix an allocation in non-sleepable context - fix one source of srcu lock latency, on devices with terrible discard latency - fix a reattach_inode() issue in fsck * tag 'bcachefs-2024-01-26' of https://evilpiepirate.org/git/bcachefs: bcachefs: __lookup_dirent() works in snapshot, not subvol bcachefs: discard path uses unlock_long() bcachefs: fix incorrect usage of REQ_OP_FLUSH bcachefs: Add gfp flags param to bch2_prt_task_backtrace()
2024-01-27Merge tag '6.8-rc2-smb3-server-fixes' of git://git.samba.org/ksmbdLinus Torvalds
Pull smb server fixes from Steve French: - Fix netlink OOB - Minor kernel doc fix * tag '6.8-rc2-smb3-server-fixes' of git://git.samba.org/ksmbd: ksmbd: fix global oob in ksmbd_nl_policy smb: Fix some kernel-doc comments
2024-01-27Merge tag '6.8-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull smb client fixes from Steve French: "Nine cifs/smb client fixes - Four network error fixes (three relating to replays of requests that need to be retried, and one fixing some places where we were returning the wrong rc up the stack on network errors) - Two multichannel fixes including locking fix and case where subset of channels need reconnect - netfs integration fixup: share remote i_size with netfslib - Two small cleanups (one for addressing a clang warning)" * tag '6.8-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: fix stray unlock in cifs_chan_skip_or_disable cifs: set replay flag for retries of write command cifs: commands that are retried should have replay flag set cifs: helper function to check replayable error codes cifs: translate network errors on send to -ECONNABORTED cifs: cifs_pick_channel should try selecting active channels cifs: Share server EOF pos with netfslib smb: Work around Clang __bdos() type confusion smb: client: delete "true", "false" defines
2024-01-27mips: Call lose_fpu(0) before initializing fcr31 in mips_set_personality_nanXi Ruoyao
If we still own the FPU after initializing fcr31, when we are preempted the dirty value in the FPU will be read out and stored into fcr31, clobbering our setting. This can cause an improper floating-point environment after execve(). For example: zsh% cat measure.c #include <fenv.h> int main() { return fetestexcept(FE_INEXACT); } zsh% cc measure.c -o measure -lm zsh% echo $((1.0/3)) # raising FE_INEXACT 0.33333333333333331 zsh% while ./measure; do ; done (stopped in seconds) Call lose_fpu(0) before setting fcr31 to prevent this. Closes: https://lore.kernel.org/linux-mips/7a6aa1bbdbbe2e63ae96ff163fab0349f58f1b9e.camel@xry111.site/ Fixes: 9b26616c8d9d ("MIPS: Respect the ISA level in FCSR handling") Cc: stable@vger.kernel.org Signed-off-by: Xi Ruoyao <xry111@xry111.site> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>