summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2024-07-29sched/fair: Remove cfs_rq::nr_spread_over and cfs_rq::exec_clockChuyi Zhou
nr_spread_over tracks the number of instances where the difference between a scheduling entity's virtual runtime and the minimum virtual runtime in the runqueue exceeds three times the scheduler latency, indicating significant disparity in task scheduling. Commit that removed its usage: 5e963f2bd: sched/fair: Commit to EEVDF cfs_rq->exec_clock was used to account for time spent executing tasks. Commit that removed its usage: 5d69eca542ee1 sched: Unify runtime accounting across classes cfs_rq::nr_spread_over and cfs_rq::exec_clock are not used anymore in eevdf. Remove them from struct cfs_rq. Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Acked-by: Vishal Chourasia <vishalc@linux.ibm.com> Link: https://lore.kernel.org/r/20240717143342.593262-1-zhouchuyi@bytedance.com
2024-07-29sched/core: Add WARN_ON_ONCE() to check overflow for migrate_disable()Peilin He
Background ========== When repeated migrate_disable() calls are made with missing the corresponding migrate_enable() calls, there is a risk of 'migration_disabled' going upper overflow because 'migration_disabled' is a type of unsigned short whose max value is 65535. In PREEMPT_RT kernel, if 'migration_disabled' goes upper overflow, it may make the migrate_disable() ineffective within local_lock_irqsave(). This is because, during the scheduling procedure, the value of 'migration_disabled' will be checked, which can trigger CPU migration. Consequently, the count of 'rcu_read_lock_nesting' may leak due to local_lock_irqsave() and local_unlock_irqrestore() occurring on different CPUs. Usecase ======== For example, When I developed a driver, I encountered a warning like "WARNING: CPU: 4 PID: 260 at kernel/rcu/tree_plugin.h:315 rcu_note_context_switch+0xa8/0x4e8" warning. It took me half a month to locate this issue. Ultimately, I discovered that the lack of upper overflow detection mechanism in migrate_disable() was the root cause, leading to a significant amount of time spent on problem localization. If the upper overflow detection mechanism was added to migrate_disable(), the root cause could be very quickly and easily identified. Effect ====== Using WARN_ON_ONCE() to check if 'migration_disabled' is upper overflow can help developers identify the issue quickly. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Peilin He<he.peilin@zte.com.cn> Signed-off-by: xu xin <xu.xin16@zte.com.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Yunkai Zhang <zhang.yunkai@zte.com.cn> Reviewed-by: Qiang Tu <tu.qiang35@zte.com.cn> Reviewed-by: Kun Jiang <jiang.kun2@zte.com.cn> Reviewed-by: Fan Yu <fan.yu9@zte.com.cn> Link: https://lkml.kernel.org/r/20240716104244764N2jD8gnBpnsLjCDnQGQ8c@zte.com.cn
2024-07-29sched: Initialize the vruntime of a new task when it is first enqueuedZhang Qiao
When creating a new task, we initialize vruntime of the newly task at sched_cgroup_fork(). However, the timing of executing this action is too early and may not be accurate. Because it uses current CPU to init the vruntime, but the new task actually runs on the cpu which be assigned at wake_up_new_task(). To optimize this case, we pass ENQUEUE_INITIAL flag to activate_task() in wake_up_new_task(), in this way, when place_entity is called in enqueue_entity(), the vruntime of the new task will be initialized. In addition, place_entity() in task_fork_fair() was introduced for two reasons: 1. Previously, the __enqueue_entity() was in task_new_fair(), in order to provide vruntime for enqueueing the newly task, the vruntime assignment equation "se->vruntime = cfs_rq->min_vruntime" was introduced by commit e9acbff6484d ("sched: introduce se->vruntime"). This is the initial state of place_entity(). 2. commit 4d78e7b656aa ("sched: new task placement for vruntime") added child_runs_first task placement feature which based on vruntime, this also requires the new task's vruntime value. After removing the child_runs_first and enqueue_entity() from task_fork_fair(), this place_entity() no longer makes sense, so remove it also. Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20240627133359.1370598-1-zhangqiao22@huawei.com
2024-07-29sched/core: Fix unbalance set_rq_online/offline() in sched_cpu_deactivate()Yang Yingliang
If cpuset_cpu_inactive() fails, set_rq_online() need be called to rollback. Fixes: 120455c514f7 ("sched: Fix hotplug vs CPU bandwidth control") Cc: stable@kernel.org Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240703031610.587047-5-yangyingliang@huaweicloud.com
2024-07-29sched/core: Introduce sched_set_rq_on/offline() helperYang Yingliang
Introduce sched_set_rq_on/offline() helper, so it can be called in normal or error path simply. No functional changed. Cc: stable@kernel.org Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240703031610.587047-4-yangyingliang@huaweicloud.com
2024-07-29sched/smt: Fix unbalance sched_smt_present dec/incYang Yingliang
I got the following warn report while doing stress test: jump label: negative count! WARNING: CPU: 3 PID: 38 at kernel/jump_label.c:263 static_key_slow_try_dec+0x9d/0xb0 Call Trace: <TASK> __static_key_slow_dec_cpuslocked+0x16/0x70 sched_cpu_deactivate+0x26e/0x2a0 cpuhp_invoke_callback+0x3ad/0x10d0 cpuhp_thread_fun+0x3f5/0x680 smpboot_thread_fn+0x56d/0x8d0 kthread+0x309/0x400 ret_from_fork+0x41/0x70 ret_from_fork_asm+0x1b/0x30 </TASK> Because when cpuset_cpu_inactive() fails in sched_cpu_deactivate(), the cpu offline failed, but sched_smt_present is decremented before calling sched_cpu_deactivate(), it leads to unbalanced dec/inc, so fix it by incrementing sched_smt_present in the error path. Fixes: c5511d03ec09 ("sched/smt: Make sched_smt_present track topology") Cc: stable@kernel.org Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chen Yu <yu.c.chen@intel.com> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Link: https://lore.kernel.org/r/20240703031610.587047-3-yangyingliang@huaweicloud.com
2024-07-29sched/smt: Introduce sched_smt_present_inc/dec() helperYang Yingliang
Introduce sched_smt_present_inc/dec() helper, so it can be called in normal or error path simply. No functional changed. Cc: stable@kernel.org Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240703031610.587047-2-yangyingliang@huaweicloud.com
2024-07-29sched/cputime: Fix mul_u64_u64_div_u64() precision for cputimeZheng Zucheng
In extreme test scenarios: the 14th field utime in /proc/xx/stat is greater than sum_exec_runtime, utime = 18446744073709518790 ns, rtime = 135989749728000 ns In cputime_adjust() process, stime is greater than rtime due to mul_u64_u64_div_u64() precision problem. before call mul_u64_u64_div_u64(), stime = 175136586720000, rtime = 135989749728000, utime = 1416780000. after call mul_u64_u64_div_u64(), stime = 135989949653530 unsigned reversion occurs because rtime is less than stime. utime = rtime - stime = 135989749728000 - 135989949653530 = -199925530 = (u64)18446744073709518790 Trigger condition: 1). User task run in kernel mode most of time 2). ARM64 architecture 3). TICK_CPU_ACCOUNTING=y CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is not set Fix mul_u64_u64_div_u64() conversion precision by reset stime to rtime Fixes: 3dc167ba5729 ("sched/cputime: Improve cputime_adjust()") Signed-off-by: Zheng Zucheng <zhengzucheng@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20240726023235.217771-1-zhengzucheng@huawei.com
2024-07-29locking/pvqspinlock: Correct the type of "old" variable in pv_kick_node()Uros Bizjak
"enum vcpu_state" is not compatible with "u8" type for all targets, resulting in: error: initialization of 'u8 *' {aka 'unsigned char *'} from incompatible pointer type 'enum vcpu_state *' for LoongArch. Correct the type of "old" variable to "u8". Fixes: fea0e1820b51 ("locking/pvqspinlock: Use try_cmpxchg() in qspinlock_paravirt.h") Closes: https://lore.kernel.org/lkml/20240719024010.3296488-1-maobibo@loongson.cn/ Reported-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Waiman Long <longman@redhat.com> Link: https://lore.kernel.org/r/20240721164552.50175-1-ubizjak@gmail.com
2024-07-29Merge drm/drm-next into drm-misc-nextThomas Zimmermann
Backmerging to get a late RC of v6.10 before moving into v6.11. Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
2024-07-29locking/csd_lock: Print large numbers as negativesPaul E. McKenney
The CSD-lock-hold diagnostics from CONFIG_CSD_LOCK_WAIT_DEBUG are printed in nanoseconds as unsigned long longs, which is a bit obtuse for human readers when timing bugs result in negative CSD-lock hold times. Yes, there are some people to whom it is immediately obvious that 18446744073709551615 is really -1, but for the rest of us... Therefore, print these numbers as signed long longs, making the negative hold times immediately apparent. Reported-by: Rik van Riel <riel@surriel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Imran Khan <imran.f.khan@oracle.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Leonardo Bras <leobras@redhat.com> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Reviewed-by: Rik van Riel <riel@surriel.com> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-07-29rcu/kfree: Warn on unexpected tail statePaul E. McKenney
Within the rcu_sr_normal_gp_cleanup_work() function, there is an acquire load from rcu_state.srs_done_tail, which is expected to be non-NULL. This commit adds a WARN_ON_ONCE() to check this expectation. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-07-29rcutorture: Make rcu_torture_write_types() print number of update typesPaul E. McKenney
This commit follows the list of update types with their count, resulting in console output like this: rcu_torture_write_types: Testing conditional GPs. rcu_torture_write_types: Testing conditional expedited GPs. rcu_torture_write_types: Testing conditional full-state GPs. rcu_torture_write_types: Testing expedited GPs. rcu_torture_write_types: Testing asynchronous GPs. rcu_torture_write_types: Testing polling GPs. rcu_torture_write_types: Testing polling full-state GPs. rcu_torture_write_types: Testing polling expedited GPs. rcu_torture_write_types: Testing polling full-state expedited GPs. rcu_torture_write_types: Testing normal GPs. rcu_torture_write_types: Testing 10 update types This commit adds the final line giving the count. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-07-29rcutorture: Generic test for NUM_ACTIVE_*RCU_POLL*Paul E. McKenney
The rcutorture test suite has specific tests for both of the NUM_ACTIVE_RCU_POLL_OLDSTATE and NUM_ACTIVE_RCU_POLL_FULL_OLDSTATE macros provided for RCU polled grace periods. However, with the advent of NUM_ACTIVE_SRCU_POLL_OLDSTATE, a more generic test is needed. This commit therefore adds ->poll_active and ->poll_active_full fields to the rcu_torture_ops structure and converts the existing specific tests to use these fields, when present. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-07-29rcutorture: Add SRCU ->same_gp_state and ->get_comp_state functionsPaul E. McKenney
This commit points the SRCU ->same_gp_state and ->get_comp_state fields to same_state_synchronize_srcu() and get_completed_synchronize_srcu(), allowing them to be tested. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-07-29rcutorture: Remove redundant rcu_torture_ops get_gp_completed fieldsPaul E. McKenney
The rcu_torture_ops structure's ->get_gp_completed and ->get_gp_completed_full fields are redundant with its ->get_comp_state and ->get_comp_state_full fields. This commit therefore removes the former in favor of the latter. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-07-29rcu/nocb: Remove SEGCBLIST_RCU_COREFrederic Weisbecker
RCU core can't be running anymore while in the middle of (de-)offloading since this sort of transition now only applies to offline CPUs. The SEGCBLIST_RCU_CORE state can therefore be removed. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-07-29rcu/nocb: Remove halfway (de-)offloading handling from rcu_coreFrederic Weisbecker
RCU core can't be running anymore while in the middle of (de-)offloading since this sort of transition now only applies to offline CPUs. The locked callback acceleration handling during the transition can therefore be removed, along with concurrent batch execution. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-07-29rcu/nocb: Remove halfway (de-)offloading handling from rcu_core()'s QS reportingFrederic Weisbecker
RCU core can't be running anymore while in the middle of (de-)offloading since this sort of transition now only applies to offline CPUs. The locked callback acceleration handling during the transition can therefore be removed. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-07-29rcu/nocb: Remove halfway (de-)offloading handling from bypassFrederic Weisbecker
Bypass enqueue can't happen anymore in the middle of (de-)offloading since this sort of transition now only applies to offline CPUs. The related safety check can therefore be removed. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-07-29rcu/nocb: (De-)offload callbacks on offline CPUs onlyFrederic Weisbecker
Currently callbacks can be (de-)offloaded only on online CPUs. This involves an overly elaborated state machine in order to make sure that callbacks are always handled during the process while ensuring synchronization between rcu_core and NOCB kthreads. The only potential user of NOCB (de-)offloading appears to be a nohz_full toggling interface through cpusets. And the general agreement is now to work toward toggling the nohz_full state on offline CPUs to simplify the whole picture. Therefore, convert the (de-)offloading to only support offline CPUs. This involves the following changes: * Call rcu_barrier() before deoffloading. An offline offloaded CPU may still carry callbacks in its queue ignored by rcutree_migrate_callbacks(). Those callbacks must all be flushed before switching to a regular queue because no more kthreads will handle those before the CPU ever gets re-onlined. This means that further calls to rcu_barrier() will find an empty queue until the CPU goes through rcutree_report_cpu_starting(). As a result it is guaranteed that further rcu_barrier() won't try to lock the nocb_lock for that target and thus won't risk an imbalance. Therefore barrier_mutex doesn't need to be locked anymore upon deoffloading. * Assume the queue is empty before offloading, as rcutree_migrate_callbacks() took care of everything. This means that further calls to rcu_barrier() will find an empty queue until the CPU goes through rcutree_report_cpu_starting(). As a result it is guaranteed that further rcu_barrier() won't risk a nocb_lock imbalance. Therefore barrier_mutex doesn't need to be locked anymore upon offloading. * No need to flush bypass anymore. Further simplifications will follow in upcoming patches. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-07-29rcu/nocb: Introduce nocb mutexFrederic Weisbecker
The barrier_mutex is used currently to protect (de-)offloading operations and prevent from nocb_lock locking imbalance in rcu_barrier() and shrinker, and also from misordered RCU barrier invocation. Now since RCU (de-)offloading is going to happen on offline CPUs, an RCU barrier will have to be executed while transitionning from offloaded to de-offloaded state. And this can't happen while holding the barrier_mutex. Introduce a NOCB mutex to protect (de-)offloading transitions. The barrier_mutex is still held for now when necessary to avoid barrier callbacks reordering and nocb_lock imbalance. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-07-29rcu/nocb: Assert no callbacks while nocb kthread allocation failsFrederic Weisbecker
When a NOCB CPU fails to create a nocb kthread on bringup, the CPU is then deoffloaded. The barrier mutex is locked at this stage. It is typically used to protect against concurrent (de-)offloading and/or concurrent rcu_barrier() that would otherwise risk a nocb locking imbalance. However: * rcu_barrier() can't run concurrently if it's the boot CPU on early boot-up. * rcu_barrier() can run concurrently if it's a secondary CPU but it is expected to see 0 callbacks on this target because it's the first time it boots. * (de-)offloading can't happen concurrently with smp_init(), as rcutorture is initialized later, at least not before device_initcall(), and userspace isn't available yet. * (de-)offloading can't happen concurrently with cpu_up(), courtesy of cpu_hotplug_lock. But: * The lazy shrinker might run concurrently with cpu_up(). It shouldn't try to grab the nocb_lock and risk an imbalance due to lazy_len supposed to be 0 but be extra cautious. * Also be cautious against resume from hibernation potential subtleties. So keep the locking and add some assertions and comments. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-07-29rcu/nocb: Move nocb field at the end of state structFrederic Weisbecker
nocb_is_setup is a rarely used field, mostly on boot and CPU hotplug. It shouldn't occupy the middle of the rcu state hot fields cacheline. Move it to the end and build it conditionally while at it. More cold NOCB fields are to come. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-07-29rcu/nocb: Introduce RCU_NOCB_LOCKDEP_WARN()Frederic Weisbecker
Checking for races against concurrent (de-)offloading implies the creation of !CONFIG_RCU_NOCB_CPU stubs to check if each relevant lock is held. For now this only implies the nocb_lock but more are to be expected. Create instead a NOCB specific version of RCU_LOCKDEP_WARN() to avoid the proliferation of stubs. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-07-29context_tracking, rcu: Rename ct_dynticks_cpu_acquire() into ↵Valentin Schneider
ct_rcu_watching_cpu_acquire() The context_tracking.state RCU_DYNTICKS subvariable has been renamed to RCU_WATCHING, reflect that change in the related helpers. Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-07-29context_tracking, rcu: Rename ct_dynticks_cpu() into ct_rcu_watching_cpu()Valentin Schneider
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to RCU_WATCHING, reflect that change in the related helpers. Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-07-29context_tracking, rcu: Rename ct_dynticks() into ct_rcu_watching()Valentin Schneider
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to RCU_WATCHING, reflect that change in the related helpers. Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-07-29context_tracking, rcu: Rename RCU_DYNTICKS_IDX into CT_RCU_WATCHINGValentin Schneider
The symbols relating to the CT_STATE part of context_tracking.state are now all prefixed with CT_STATE. The RCU dynticks counter part of that atomic variable still involves symbols with different prefixes, align them all to be prefixed with CT_RCU_WATCHING. Suggested-by: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-07-29treewide: context_tracking: Rename CONTEXT_* into CT_STATE_*Valentin Schneider
Context tracking state related symbols currently use a mix of the CONTEXT_ (e.g. CONTEXT_KERNEL) and CT_SATE_ (e.g. CT_STATE_MASK) prefixes. Clean up the naming and make the ctx_state enum use the CT_STATE_ prefix. Suggested-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Acked-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-07-28minmax: make generic MIN() and MAX() macros available everywhereLinus Torvalds
This just standardizes the use of MIN() and MAX() macros, with the very traditional semantics. The goal is to use these for C constant expressions and for top-level / static initializers, and so be able to simplify the min()/max() macros. These macro names were used by various kernel code - they are very traditional, after all - and all such users have been fixed up, with a few different approaches: - trivial duplicated macro definitions have been removed Note that 'trivial' here means that it's obviously kernel code that already included all the major kernel headers, and thus gets the new generic MIN/MAX macros automatically. - non-trivial duplicated macro definitions are guarded with #ifndef This is the "yes, they define their own versions, but no, the include situation is not entirely obvious, and maybe they don't get the generic version automatically" case. - strange use case #1 A couple of drivers decided that the way they want to describe their versioning is with #define MAJ 1 #define MIN 2 #define DRV_VERSION __stringify(MAJ) "." __stringify(MIN) which adds zero value and I just did my Alexander the Great impersonation, and rewrote that pointless Gordian knot as #define DRV_VERSION "1.2" instead. - strange use case #2 A couple of drivers thought that it's a good idea to have a random 'MIN' or 'MAX' define for a value or index into a table, rather than the traditional macro that takes arguments. These values were re-written as C enum's instead. The new function-line macros only expand when followed by an open parenthesis, and thus don't clash with enum use. Happily, there weren't really all that many of these cases, and a lot of users already had the pattern of using '#ifndef' guarding (or in one case just using '#undef MIN') before defining their own private version that does the same thing. I left such cases alone. Cc: David Laight <David.Laight@aculab.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-07-27Merge tag 'timers-urgent-2024-07-26' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer migration updates from Thomas Gleixner: "Fixes and minor updates for the timer migration code: - Stop testing the group->parent pointer as it is not guaranteed to be stable over a chain of operations by design. This includes a warning which would be nice to have but it produces false positives due to the racy nature of the check. - Plug a race between CPUs going in and out of idle and a CPU hotplug operation. The latter can create and connect a new hierarchy level which is missed in the concurrent updates of CPUs which go into idle. As a result the events of such a CPU might not be processed and timers go stale. Cure it by splitting the hotplug operation into a prepare and online callback. The prepare callback is guaranteed to run on an online and therefore active CPU. This CPU updates the hierarchy and being online ensures that there is always at least one migrator active which handles the modified hierarchy correctly when going idle. The online callback which runs on the incoming CPU then just marks the CPU active and brings it into operation. - Improve tracing and polish the code further so it is more obvious what's going on" * tag 'timers-urgent-2024-07-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timers/migration: Fix grammar in comment timers/migration: Spare write when nothing changed timers/migration: Rename childmask by groupmask to make naming more obvious timers/migration: Read childmask and parent pointer in a single place timers/migration: Use a single struct for hierarchy walk data timers/migration: Improve tracing timers/migration: Move hierarchy setup into cpuhotplug prepare callback timers/migration: Do not rely always on group->parent
2024-07-25Merge tag 'net-6.11-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from bpf and netfilter. A lot of networking people were at a conference last week, busy catching COVID, so relatively short PR. Current release - regressions: - tcp: process the 3rd ACK with sk_socket for TFO and MPTCP Current release - new code bugs: - l2tp: protect session IDR and tunnel session list with one lock, make sure the state is coherent to avoid a warning - eth: bnxt_en: update xdp_rxq_info in queue restart logic - eth: airoha: fix location of the MBI_RX_AGE_SEL_MASK field Previous releases - regressions: - xsk: require XDP_UMEM_TX_METADATA_LEN to actuate tx_metadata_len, the field reuses previously un-validated pad Previous releases - always broken: - tap/tun: drop short frames to prevent crashes later in the stack - eth: ice: add a per-VF limit on number of FDIR filters - af_unix: disable MSG_OOB handling for sockets in sockmap/sockhash" * tag 'net-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (34 commits) tun: add missing verification for short frame tap: add missing verification for short frame mISDN: Fix a use after free in hfcmulti_tx() gve: Fix an edge case for TSO skb validity check bnxt_en: update xdp_rxq_info in queue restart logic tcp: process the 3rd ACK with sk_socket for TFO/MPTCP selftests/bpf: Add XDP_UMEM_TX_METADATA_LEN to XSK TX metadata test xsk: Require XDP_UMEM_TX_METADATA_LEN to actuate tx_metadata_len bpf: Fix a segment issue when downgrading gso_size net: mediatek: Fix potential NULL pointer dereference in dummy net_device handling MAINTAINERS: make Breno the netconsole maintainer MAINTAINERS: Update bonding entry net: nexthop: Initialize all fields in dumped nexthops net: stmmac: Correct byte order of perfect_match selftests: forwarding: skip if kernel not support setting bridge fdb learning limit tipc: Return non-zero value from tipc_udp_addr2str() on error netfilter: nft_set_pipapo_avx2: disable softinterrupts ice: Fix recipe read procedure ice: Add a per-VF limit on number of FDIR filters net: bonding: correctly annotate RCU in bond_should_notify_peers() ...
2024-07-25Merge tag 'printk-for-6.11-trivial' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux Pull printk updates from Petr Mladek: - trivial printk changes The bigger "real" printk work is still being discussed. * tag 'printk-for-6.11-trivial' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: vsprintf: add missing MODULE_DESCRIPTION() macro printk: Rename console_replay_all() and update context
2024-07-25Merge tag 'constfy-sysctl-6.11-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl Pull sysctl constification from Joel Granados: "Treewide constification of the ctl_table argument of proc_handlers using a coccinelle script and some manual code formatting fixups. This is a prerequisite to moving the static ctl_table structs into read-only data section which will ensure that proc_handler function pointers cannot be modified" * tag 'constfy-sysctl-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl: sysctl: treewide: constify the ctl_table argument of proc_handlers
2024-07-25Merge tag 'kgdb-6.11-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux Pull kgdb updates from Daniel Thompson: "Three small changes this cycle: - Clean up an architecture abstraction that is no longer needed because all the architectures have converged. - Actually use the prompt argument to kdb_position_cursor() instead of ignoring it (functionally this fix is a nop but that was due to luck rather than good judgement) - Fix a -Wformat-security warning" * tag 'kgdb-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux: kdb: Get rid of redundant kdb_curr_task() kdb: Use the passed prompt in kdb_position_cursor() kdb: address -Wformat-security warnings
2024-07-25Merge tag 'dma-mapping-6.11-2024-07-24' of ↵Linus Torvalds
git://git.infradead.org/users/hch/dma-mapping Pull dma-mapping fix from Christoph Hellwig: - fix the order of actions in dmam_free_coherent (Lance Richardson) * tag 'dma-mapping-6.11-2024-07-24' of git://git.infradead.org/users/hch/dma-mapping: dma: fix call order in dmam_free_coherent
2024-07-25Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2024-07-25 We've added 14 non-merge commits during the last 8 day(s) which contain a total of 19 files changed, 177 insertions(+), 70 deletions(-). The main changes are: 1) Fix af_unix to disable MSG_OOB handling for sockets in BPF sockmap and BPF sockhash. Also add test coverage for this case, from Michal Luczaj. 2) Fix a segmentation issue when downgrading gso_size in the BPF helper bpf_skb_adjust_room(), from Fred Li. 3) Fix a compiler warning in resolve_btfids due to a missing type cast, from Liwei Song. 4) Fix stack allocation for arm64 to align the stack pointer at a 16 byte boundary in the fexit_sleep BPF selftest, from Puranjay Mohan. 5) Fix a xsk regression to require a flag when actuating tx_metadata_len, from Stanislav Fomichev. 6) Fix function prototype BTF dumping in libbpf for prototypes that have no input arguments, from Andrii Nakryiko. 7) Fix stacktrace symbol resolution in perf script for BPF programs containing subprograms, from Hou Tao. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Add XDP_UMEM_TX_METADATA_LEN to XSK TX metadata test xsk: Require XDP_UMEM_TX_METADATA_LEN to actuate tx_metadata_len bpf: Fix a segment issue when downgrading gso_size tools/resolve_btfids: Fix comparison of distinct pointer types warning in resolve_btfids bpf, events: Use prog to emit ksymbol event for main program selftests/bpf: Test sockmap redirect for AF_UNIX MSG_OOB selftests/bpf: Parametrize AF_UNIX redir functions to accept send() flags selftests/bpf: Support SOCK_STREAM in unix_inet_redir_to_connected() af_unix: Disable MSG_OOB handling for sockets in sockmap/sockhash bpftool: Fix typo in usage help libbpf: Fix no-args func prototype BTF dumping syntax MAINTAINERS: Update powerpc BPF JIT maintainers MAINTAINERS: Update email address of Naveen selftests/bpf: fexit_sleep: Fix stack allocation for arm64 ==================== Link: https://patch.msgid.link/20240725114312.32197-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-24sysctl: treewide: constify the ctl_table argument of proc_handlersJoel Granados
const qualify the struct ctl_table argument in the proc_handler function signatures. This is a prerequisite to moving the static ctl_table structs into .rodata data which will ensure that proc_handler function pointers cannot be modified. This patch has been generated by the following coccinelle script: ``` virtual patch @r1@ identifier ctl, write, buffer, lenp, ppos; identifier func !~ "appldata_(timer|interval)_handler|sched_(rt|rr)_handler|rds_tcp_skbuf_handler|proc_sctp_do_(hmac_alg|rto_min|rto_max|udp_port|alpha_beta|auth|probe_interval)"; @@ int func( - struct ctl_table *ctl + const struct ctl_table *ctl ,int write, void *buffer, size_t *lenp, loff_t *ppos); @r2@ identifier func, ctl, write, buffer, lenp, ppos; @@ int func( - struct ctl_table *ctl + const struct ctl_table *ctl ,int write, void *buffer, size_t *lenp, loff_t *ppos) { ... } @r3@ identifier func; @@ int func( - struct ctl_table * + const struct ctl_table * ,int , void *, size_t *, loff_t *); @r4@ identifier func, ctl; @@ int func( - struct ctl_table *ctl + const struct ctl_table *ctl ,int , void *, size_t *, loff_t *); @r5@ identifier func, write, buffer, lenp, ppos; @@ int func( - struct ctl_table * + const struct ctl_table * ,int write, void *buffer, size_t *lenp, loff_t *ppos); ``` * Code formatting was adjusted in xfs_sysctl.c to comply with code conventions. The xfs_stats_clear_proc_handler, xfs_panic_mask_proc_handler and xfs_deprecated_dointvec_minmax where adjusted. * The ctl_table argument in proc_watchdog_common was const qualified. This is called from a proc_handler itself and is calling back into another proc_handler, making it necessary to change it as part of the proc_handler migration. Co-developed-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Co-developed-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Joel Granados <j.granados@samsung.com>
2024-07-23Merge tag 'kbuild-v6.11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild updates from Masahiro Yamada: - Remove tristate choice support from Kconfig - Stop using the PROVIDE() directive in the linker script - Reduce the number of links for the combination of CONFIG_KALLSYMS and CONFIG_DEBUG_INFO_BTF - Enable the warning for symbol reference to .exit.* sections by default - Fix warnings in RPM package builds - Improve scripts/make_fit.py to generate a FIT image with separate base DTB and overlays - Improve choice value calculation in Kconfig - Fix conditional prompt behavior in choice in Kconfig - Remove support for the uncommon EMAIL environment variable in Debian package builds - Remove support for the uncommon "name <email>" form for the DEBEMAIL environment variable - Raise the minimum supported GNU Make version to 4.0 - Remove stale code for the absolute kallsyms - Move header files commonly used for host programs to scripts/include/ - Introduce the pacman-pkg target to generate a pacman package used in Arch Linux - Clean up Kconfig * tag 'kbuild-v6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (65 commits) kbuild: doc: gcc to CC change kallsyms: change sym_entry::percpu_absolute to bool type kallsyms: unify seq and start_pos fields of struct sym_entry kallsyms: add more original symbol type/name in comment lines kallsyms: use \t instead of a tab in printf() kallsyms: avoid repeated calculation of array size for markers kbuild: add script and target to generate pacman package modpost: use generic macros for hash table implementation kbuild: move some helper headers from scripts/kconfig/ to scripts/include/ Makefile: add comment to discourage tools/* addition for kernel builds kbuild: clean up scripts/remove-stale-files kconfig: recursive checks drop file/lineno kbuild: rpm-pkg: introduce a simple changelog section for kernel.spec kallsyms: get rid of code for absolute kallsyms kbuild: Create INSTALL_PATH directory if it does not exist kbuild: Abort make on install failures kconfig: remove 'e1' and 'e2' macros from expression deduplication kconfig: remove SYMBOL_CHOICEVAL flag kconfig: add const qualifiers to several function arguments kconfig: call expr_eliminate_yn() at least once in expr_eliminate_dups() ...
2024-07-23Merge tag 'livepatching-for-6.11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching Pull livepatching update from Petr Mladek: - show patch->replace flag in sysfs - add or improve few selftests * tag 'livepatching-for-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/livepatching/livepatching: livepatch: Replace snprintf() with sysfs_emit() selftests/livepatch: Add selftests for "replace" sysfs attribute livepatch: Add "replace" sysfs attribute selftests: livepatch: Test atomic replace against multiple modules selftests/livepatch: define max test-syscall processes
2024-07-22Merge tag 'irq-msi-2024-07-22' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull MSI interrupt updates from Thomas Gleixner: "Switch ARM/ARM64 over to the modern per device MSI domains. This simplifies the handling of platform MSI and wire to MSI controllers and removes about 500 lines of legacy code. Aside of that it paves the way for ARM/ARM64 to utilize the dynamic allocation of PCI/MSI interrupts and to support the upcoming non standard IMS (Interrupt Message Store) mechanism on PCIe devices" * tag 'irq-msi-2024-07-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits) irqchip/gic-v3-its: Correctly fish out the DID for platform MSI irqchip/gic-v3-its: Correctly honor the RID remapping genirq/msi: Move msi_device_data to core genirq/msi: Remove platform MSI leftovers irqchip/irq-mvebu-icu: Remove platform MSI leftovers irqchip/irq-mvebu-sei: Switch to MSI parent irqchip/mvebu-odmi: Switch to parent MSI irqchip/mvebu-gicp: Switch to MSI parent irqchip/irq-mvebu-icu: Prepare for real per device MSI irqchip/imx-mu-msi: Switch to MSI parent irqchip/gic-v2m: Switch to device MSI irqchip/gic_v3_mbi: Switch over to parent domain genirq/msi: Remove platform_msi_create_device_domain() irqchip/mbigen: Remove platform_msi_create_device_domain() fallback irqchip/gic-v3-its: Switch platform MSI to MSI parent irqchip/irq-msi-lib: Prepare for DOMAIN_BUS_WIRED_TO_MSI irqchip/mbigen: Prepare for real per device MSI irqchip/irq-msi-lib: Prepare for DEVICE MSI to replace platform MSI irqchip/gic-v3-its: Provide MSI parent for PCI/MSI[-X] irqchip/irq-msi-lib: Prepare for PCI MSI/MSIX ...
2024-07-22Merge tag 'irq-core-2024-07-15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull interrupt subsystem updates from Thomas Gleixner: "Core: - Provide a new mechanism to create interrupt domains. The existing interfaces have already too many parameters and it's a pain to expand any of this for new required functionality. The new function takes a pointer to a data structure as argument. The data structure combines all existing parameters and allows for easy extension. The first extension for this is to handle the instantiation of generic interrupt chips at the core level and to allow drivers to provide extra init/exit callbacks. This is necessary to do the full interrupt chip initialization before the new domain is published, so that concurrent usage sites won't see a half initialized interrupt domain. Similar problems exist on teardown. This has turned out to be a real problem due to the deferred and parallel probing which was added in recent years. Handling this at the core level allows to remove quite some accrued boilerplate code in existing drivers and avoids horrible workarounds at the driver level. - The usual small improvements all over the place Drivers: - Add support for LAN966x OIC and RZ/Five SoC - Split the STM ExtI driver into a microcontroller and a SMP version to allow building the latter as a module for multi-platform kernels - Enable MSI support for Armada 370XP on platforms which do not support IPIs - The usual small fixes and enhancements all over the place" * tag 'irq-core-2024-07-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (59 commits) irqdomain: Fix the kernel-doc and plug it into Documentation genirq: Set IRQF_COND_ONESHOT in request_irq() irqchip/imx-irqsteer: Handle runtime power management correctly irqchip/gic-v3: Pass #redistributor-regions to gic_of_setup_kvm_info() irqchip/bcm2835: Enable SKIP_SET_WAKE and MASK_ON_SUSPEND irqchip/gic-v4: Make sure a VPE is locked when VMAPP is issued irqchip/gic-v4: Substitute vmovp_lock for a per-VM lock irqchip/gic-v4: Always configure affinity on VPE activation Revert "irqchip/dw-apb-ictl: Support building as module" Revert "Loongarch: Support loongarch avec" arm64: Kconfig: Allow build irq-stm32mp-exti driver as module ARM: stm32: Allow build irq-stm32mp-exti driver as module irqchip/stm32mp-exti: Allow building as module irqchip/stm32mp-exti: Rename internal symbols irqchip/stm32-exti: Split MCU and MPU code arm64: Kconfig: Select STM32MP_EXTI on STM32 platforms ARM: stm32: Use different EXTI driver on ARMv7m and ARMv7a irqchip/stm32-exti: Add CONFIG_STM32MP_EXTI irqchip/dw-apb-ictl: Support building as module irqchip/riscv-aplic: Simplify the initialization code ...
2024-07-22timers/migration: Fix grammar in commentAnna-Maria Behnsen
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240716-tmigr-fixes-v4-8-757baa7803fe@linutronix.de
2024-07-22timers/migration: Spare write when nothing changedAnna-Maria Behnsen
The wakeup value is written unconditionally in tmigr_cpu_new_timer(). When there was no new next timer expiry that needs to be propagated, then the value that was read before is written. This is not required. Move the write to the place where wakeup value is changed changed. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240716-tmigr-fixes-v4-7-757baa7803fe@linutronix.de
2024-07-22timers/migration: Rename childmask by groupmask to make naming more obviousAnna-Maria Behnsen
childmask in the group reflects the mask that is required to 'reference' this group in the parent. When reading childmask, this might be confusing, as this suggests, that this is the mask of the child of the group. Clarify this by renaming childmask in the tmigr_group and tmc_group by groupmask. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240716-tmigr-fixes-v4-6-757baa7803fe@linutronix.de
2024-07-22timers/migration: Read childmask and parent pointer in a single placeAnna-Maria Behnsen
Reading the childmask and parent pointer is required when propagating changes through the hierarchy. At the moment this reads are spread all over the place which makes it harder to follow. Move those reads to a single place to keep code clean. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240716-tmigr-fixes-v4-5-757baa7803fe@linutronix.de
2024-07-22timers/migration: Use a single struct for hierarchy walk dataAnna-Maria Behnsen
Two different structs are defined for propagating data from one to another level when walking the hierarchy. Several struct members exist in both structs which makes generalization harder. Merge those two structs into a single one and use it directly in walk_groups() and the corresponding function pointers instead of introducing pointer casting all over the place. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240716-tmigr-fixes-v4-4-757baa7803fe@linutronix.de
2024-07-22timers/migration: Improve tracingAnna-Maria Behnsen
Trace points of inactive and active propagation are located at the end of the related functions. The interesting information of those trace points is the updated group state. When trace points are not located directly at the place where group state changed, order of trace points in traces could be confusing. Move inactive and active propagation trace points directly after update of group state values. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240716-tmigr-fixes-v4-3-757baa7803fe@linutronix.de
2024-07-22timers/migration: Move hierarchy setup into cpuhotplug prepare callbackAnna-Maria Behnsen
When a CPU comes online the first time, it is possible that a new top level group will be created. In general all propagation is done from the bottom to top. This minimizes complexity and prevents possible races. But when a new top level group is created, the formely top level group needs to be connected to the new level. This is the only time, when the direction to propagate changes is changed: the changes are propagated from top (new top level group) to bottom (formerly top level group). This introduces two races (see (A) and (B)) as reported by Frederic: (A) This race happens, when marking the formely top level group as active, but the last active CPU of the formerly top level group goes idle. Then it's likely that formerly group is no longer active, but marked nevertheless as active in new top level group: [GRP0:0] migrator = 0 active = 0 nextevt = KTIME_MAX / \ 0 1 .. 7 active idle 0) Hierarchy has for now only 8 CPUs and CPU 0 is the only active CPU. [GRP1:0] migrator = TMIGR_NONE active = NONE nextevt = KTIME_MAX \ [GRP0:0] [GRP0:1] migrator = 0 migrator = TMIGR_NONE active = 0 active = NONE nextevt = KTIME_MAX nextevt = KTIME_MAX / \ 0 1 .. 7 8 active idle !online 1) CPU 8 is booting and creates a new group in first level GRP0:1 and therefore also a new top group GRP1:0. For now the setup code proceeded only until the connected between GRP0:1 to the new top group. The connection between CPU8 and GRP0:1 is not yet established and CPU 8 is still !online. [GRP1:0] migrator = TMIGR_NONE active = NONE nextevt = KTIME_MAX / \ [GRP0:0] [GRP0:1] migrator = 0 migrator = TMIGR_NONE active = 0 active = NONE nextevt = KTIME_MAX nextevt = KTIME_MAX / \ 0 1 .. 7 8 active idle !online 2) Setup code now connects GRP0:0 to GRP1:0 and observes while in tmigr_connect_child_parent() that GRP0:0 is not TMIGR_NONE. So it prepares to call tmigr_active_up() on it. It hasn't done it yet. [GRP1:0] migrator = TMIGR_NONE active = NONE nextevt = KTIME_MAX / \ [GRP0:0] [GRP0:1] migrator = TMIGR_NONE migrator = TMIGR_NONE active = NONE active = NONE nextevt = KTIME_MAX nextevt = KTIME_MAX / \ 0 1 .. 7 8 idle idle !online 3) CPU 0 goes idle. Since GRP0:0->parent has been updated by CPU 8 with GRP0:0->lock held, CPU 0 observes GRP1:0 after calling tmigr_update_events() and it propagates the change to the top (no change there and no wakeup programmed since there is no timer). [GRP1:0] migrator = GRP0:0 active = GRP0:0 nextevt = KTIME_MAX / \ [GRP0:0] [GRP0:1] migrator = TMIGR_NONE migrator = TMIGR_NONE active = NONE active = NONE nextevt = KTIME_MAX nextevt = KTIME_MAX / \ 0 1 .. 7 8 idle idle !online 4) Now the setup code finally calls tmigr_active_up() to and sets GRP0:0 active in GRP1:0 [GRP1:0] migrator = GRP0:0 active = GRP0:0, GRP0:1 nextevt = KTIME_MAX / \ [GRP0:0] [GRP0:1] migrator = TMIGR_NONE migrator = 8 active = NONE active = 8 nextevt = KTIME_MAX nextevt = KTIME_MAX / \ | 0 1 .. 7 8 idle idle active 5) Now CPU 8 is connected with GRP0:1 and CPU 8 calls tmigr_active_up() out of tmigr_cpu_online(). [GRP1:0] migrator = GRP0:0 active = GRP0:0 nextevt = T8 / \ [GRP0:0] [GRP0:1] migrator = TMIGR_NONE migrator = TMIGR_NONE active = NONE active = NONE nextevt = KTIME_MAX nextevt = T8 / \ | 0 1 .. 7 8 idle idle idle 5) CPU 8 goes idle with a timer T8 and relies on GRP0:0 as the migrator. But it's not really active, so T8 gets ignored. --> The update which is done in third step is not noticed by setup code. So a wrong migrator is set to top level group and a timer could get ignored. (B) Reading group->parent and group->childmask when an hierarchy update is ongoing and reaches the formerly top level group is racy as those values could be inconsistent. (The notation of migrator and active now slightly changes in contrast to the above example, as now the childmasks are used.) [GRP1:0] migrator = TMIGR_NONE active = 0x00 nextevt = KTIME_MAX \ [GRP0:0] [GRP0:1] migrator = TMIGR_NONE migrator = TMIGR_NONE active = 0x00 active = 0x00 nextevt = KTIME_MAX nextevt = KTIME_MAX childmask= 0 childmask= 1 parent = NULL parent = GRP1:0 / \ 0 1 .. 7 8 idle idle !online childmask=1 1) Hierarchy has 8 CPUs. CPU 8 is at the moment in the process of onlining but did not yet connect GRP0:0 to GRP1:0. [GRP1:0] migrator = TMIGR_NONE active = 0x00 nextevt = KTIME_MAX / \ [GRP0:0] [GRP0:1] migrator = TMIGR_NONE migrator = TMIGR_NONE active = 0x00 active = 0x00 nextevt = KTIME_MAX nextevt = KTIME_MAX childmask= 0 childmask= 1 parent = GRP1:0 parent = GRP1:0 / \ 0 1 .. 7 8 idle idle !online childmask=1 2) Setup code (running on CPU 8) now connects GRP0:0 to GRP1:0, updates parent pointer of GRP0:0 and ... [GRP1:0] migrator = TMIGR_NONE active = 0x00 nextevt = KTIME_MAX / \ [GRP0:0] [GRP0:1] migrator = 0x01 migrator = TMIGR_NONE active = 0x01 active = 0x00 nextevt = KTIME_MAX nextevt = KTIME_MAX childmask= 0 childmask= 1 parent = GRP1:0 parent = GRP1:0 / \ 0 1 .. 7 8 active idle !online childmask=1 tmigr_walk.childmask = 0 3) ... CPU 0 comes active in the same time. As migrator in GRP0:0 was TMIGR_NONE, childmask of GRP0:0 is stored in update propagation data structure tmigr_walk (as update of childmask is not yet visible/updated). And now ... [GRP1:0] migrator = TMIGR_NONE active = 0x00 nextevt = KTIME_MAX / \ [GRP0:0] [GRP0:1] migrator = 0x01 migrator = TMIGR_NONE active = 0x01 active = 0x00 nextevt = KTIME_MAX nextevt = KTIME_MAX childmask= 2 childmask= 1 parent = GRP1:0 parent = GRP1:0 / \ 0 1 .. 7 8 active idle !online childmask=1 tmigr_walk.childmask = 0 4) ... childmask of GRP0:0 is updated by CPU 8 (still part of setup code). [GRP1:0] migrator = 0x00 active = 0x00 nextevt = KTIME_MAX / \ [GRP0:0] [GRP0:1] migrator = 0x01 migrator = TMIGR_NONE active = 0x01 active = 0x00 nextevt = KTIME_MAX nextevt = KTIME_MAX childmask= 2 childmask= 1 parent = GRP1:0 parent = GRP1:0 / \ 0 1 .. 7 8 active idle !online childmask=1 tmigr_walk.childmask = 0 5) CPU 0 sees the connection to GRP1:0 and now propagates active state to GRP1:0 but with childmask = 0 as stored in propagation data structure. --> Now GRP1:0 always has a migrator as 0x00 != TMIGR_NONE and for all CPUs it looks like GRP1:0 is always active. To prevent those races, the setup of the hierarchy is moved into the cpuhotplug prepare callback. The prepare callback is not executed by the CPU which will come online, it is executed by the CPU which prepares onlining of the other CPU. This CPU is active while it is connecting the formerly top level to the new one. This prevents from (A) to happen and it also prevents from any further walk above the formerly top level until that active CPU becomes inactive, releasing the new ->parent and ->childmask updates to be visible by any subsequent walk up above the formerly top level hierarchy. This prevents from (B) to happen. The direction for the updates is now forced to look like "from bottom to top". However if the active CPU prevents from tmigr_cpu_(in)active() to walk up with the update not-or-half visible, nothing prevents walking up to the new top with a 0 childmask in tmigr_handle_remote_up() or tmigr_requires_handle_remote_up() if the active CPU doing the prepare is not the migrator. But then it looks fine because: * tmigr_check_migrator() should just return false * The migrator is active and should eventually observe the new childmask at some point in a future tick. Split setup functionality of online callback into the cpuhotplug prepare callback and setup hotplug state. Change init call into early_initcall() to make sure an already active CPU prepares everything for newly upcoming CPUs. Reorder the code, that all prepare related functions are close to each other and online and offline callbacks are also close together. Fixes: 7ee988770326 ("timers: Implement the hierarchical pull model") Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/r/20240717094940.18687-1-anna-maria@linutronix.de