summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2024-07-29genirq/msi: Silence 'set affinity failed' warningMarek Vasut
Various PCI controllers that mux MSIs onto a single IRQ line produce these "IRQ%d: set affinity failed" warnings when entering suspend. This has been discussed before [1] [2] and an example test case is included at the end of this commit message. Controller drivers that create MSI IRQ domain with MSI_FLAG_USE_DEF_CHIP_OPS and do not override the .irq_set_affinity() irqchip callback get assigned the default msi_domain_set_affinity() callback. That is not desired on controllers where it is not possible to set affinity of each MSI IRQ line to a specific CPU core due to hardware limitation. Introduce flag MSI_FLAG_NO_AFFINITY, which keeps .irq_set_affinity() unset if the controller driver did not assign it. This way, migrate_one_irq() can exit right away, without printing the warning. The .irq_set_affinity() implementations which only return -EINVAL can be removed from multiple controller drivers. $ grep 25 /proc/interrupts 25: 0 0 0 0 0 0 0 0 PCIe MSI 0 Edge PCIe PME $ echo core > /sys/power/pm_test ; echo mem > /sys/power/state ... Disabling non-boot CPUs ... IRQ25: set affinity failed(-22). <---------- This is being silenced here psci: CPU7 killed (polled 4 ms) ... [1] https://lore.kernel.org/all/d4a6eea3c5e33a3a4056885419df95a7@kernel.org/ [2] https://lore.kernel.org/all/5f4947b18bf381615a37aa81c2242477@kernel.org/ Link: https://lore.kernel.org/r/20240723132958.41320-2-marek.vasut+renesas@mailbox.org Signed-off-by: Marek Vasut <marek.vasut+renesas@mailbox.org> [bhelgaas: commit log] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Acked-by: Thomas Gleixner <tglx@linutronix.de>
2024-07-29profiling: remove prof_cpu_maskTetsuo Handa
syzbot is reporting uninit-value at profile_hits(), for there is a race window between if (!alloc_cpumask_var(&prof_cpu_mask, GFP_KERNEL)) return -ENOMEM; cpumask_copy(prof_cpu_mask, cpu_possible_mask); in profile_init() and cpumask_available(prof_cpu_mask) && cpumask_test_cpu(smp_processor_id(), prof_cpu_mask)) in profile_tick(); prof_cpu_mask remains uninitialzed until cpumask_copy() completes while cpumask_available(prof_cpu_mask) returns true as soon as alloc_cpumask_var(&prof_cpu_mask) completes. We could replace alloc_cpumask_var() with zalloc_cpumask_var() and call cpumask_copy() from create_proc_profile() on only UP kernels, for profile_online_cpu() calls cpumask_set_cpu() as needed via cpuhp_setup_state(CPUHP_AP_ONLINE_DYN) on SMP kernels. But this patch removes prof_cpu_mask because it seems unnecessary. The cpumask_test_cpu(smp_processor_id(), prof_cpu_mask) test in profile_tick() is likely always true due to a CPU cannot call profile_tick() if that CPU is offline and cpumask_set_cpu(cpu, prof_cpu_mask) is called when that CPU becomes online and cpumask_clear_cpu(cpu, prof_cpu_mask) is called when that CPU becomes offline . This test could be false during transition between online and offline. But according to include/linux/cpuhotplug.h , CPUHP_PROFILE_PREPARE belongs to PREPARE section, which means that the CPU subjected to profile_dead_cpu() cannot be inside profile_tick() (i.e. no risk of use-after-free bug) because interrupt for that CPU is disabled during PREPARE section. Therefore, this test is guaranteed to be true, and can be removed. (Since profile_hits() checks prof_buffer != NULL, we don't need to check prof_buffer != NULL here unless get_irq_regs() or user_mode() is such slow that we want to avoid when prof_buffer == NULL). do_profile_hits() is called from profile_tick() from timer interrupt only if cpumask_test_cpu(smp_processor_id(), prof_cpu_mask) is true and prof_buffer is not NULL. But syzbot is also reporting that sometimes do_profile_hits() is called while current thread is still doing vzalloc(), where prof_buffer must be NULL at this moment. This indicates that multiple threads concurrently tried to write to /sys/kernel/profiling interface, which caused that somebody else try to re-allocate prof_buffer despite somebody has already allocated prof_buffer. Fix this by using serialization. Reported-by: syzbot <syzbot+b1a83ab2a9eb9321fbdd@syzkaller.appspotmail.com> Closes: https://syzkaller.appspot.com/bug?extid=b1a83ab2a9eb9321fbdd Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Tested-by: syzbot <syzbot+b1a83ab2a9eb9321fbdd@syzkaller.appspotmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-07-29sched/fair: Cleanup fair_serverPeter Zijlstra
The throttle interaction made my brain hurt, make it consistently about 0 transitions of h_nr_running. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2024-07-29sched/rt: Remove default bandwidth controlPeter Zijlstra
Now that fair_server exists, we no longer need RT bandwidth control unless RT_GROUP_SCHED. Enable fair_server with parameters equivalent to RT throttling. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: "Vineeth Pillai (Google)" <vineeth@bitbyteword.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/14d562db55df5c3c780d91940743acb166895ef7.1716811044.git.bristot@kernel.org
2024-07-29sched/core: Fix picking of tasks for core scheduling with DL serverJoel Fernandes (Google)
* Use simple CFS pick_task for DL pick_task DL server's pick_task calls CFS's pick_next_task_fair(), this is wrong because core scheduling's pick_task only calls CFS's pick_task() for evaluation / checking of the CFS task (comparing across CPUs), not for actually affirmatively picking the next task. This causes RB tree corruption issues in CFS that were found by syzbot. * Make pick_task_fair clear DL server A DL task pick might set ->dl_server, but it is possible the task will never run (say the other HT has a stop task). If the CFS task is picked in the future directly (say without DL server), ->dl_server will be set. So clear it in pick_task_fair(). This fixes the KASAN issue reported by syzbot in set_next_entity(). (DL refactoring suggestions by Vineeth Pillai). Reported-by: Suleiman Souhlal <suleiman@google.com> Signed-off-by: "Joel Fernandes (Google)" <joel@joelfernandes.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vineeth Pillai <vineeth@bitbyteword.org> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/b10489ab1f03d23e08e6097acea47442e7d6466f.1716811044.git.bristot@kernel.org
2024-07-29sched/core: Fix priority checking for DL server picksJoel Fernandes (Google)
In core scheduling, a DL server pick (which is CFS task) should be given higher priority than tasks in other classes. Not doing so causes CFS starvation. A kselftest is added later to demonstrate this. A CFS task that is competing with RT tasks can be completely starved without this and the DL server's boosting completely ignored. Fix these problems. Reported-by: Suleiman Souhlal <suleiman@google.com> Signed-off-by: "Joel Fernandes (Google)" <joel@joelfernandes.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vineeth Pillai <vineeth@bitbyteword.org> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/48b78521d86f3b33c24994d843c1aad6b987dda9.1716811044.git.bristot@kernel.org
2024-07-29sched/fair: Fair server interfaceDaniel Bristot de Oliveira
Add an interface for fair server setup on debugfs. Each CPU has two files under /debug/sched/fair_server/cpu{ID}: - runtime: set runtime in ns - period: set period in ns This then leaves /proc/sys/kernel/sched_rt_{period,runtime}_us to set bounds on admission control. The interface also add the server to the dl bandwidth accounting. Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/a9ef9fc69bcedb44bddc9bc34f2b313296052819.1716811044.git.bristot@kernel.org
2024-07-29sched/deadline: Deferrable dl serverDaniel Bristot de Oliveira
Among the motivations for the DL servers is the real-time throttling mechanism. This mechanism works by throttling the rt_rq after running for a long period without leaving space for fair tasks. The base dl server avoids this problem by boosting fair tasks instead of throttling the rt_rq. The point is that it boosts without waiting for potential starvation, causing some non-intuitive cases. For example, an IRQ dispatches two tasks on an idle system, a fair and an RT. The DL server will be activated, running the fair task before the RT one. This problem can be avoided by deferring the dl server activation. By setting the defer option, the dl_server will dispatch an SCHED_DEADLINE reservation with replenished runtime, but throttled. The dl_timer will be set for the defer time at (period - runtime) ns from start time. Thus boosting the fair rq at defer time. If the fair scheduler has the opportunity to run while waiting for defer time, the dl server runtime will be consumed. If the runtime is completely consumed before the defer time, the server will be replenished while still in a throttled state. Then, the dl_timer will be reset to the new defer time If the fair server reaches the defer time without consuming its runtime, the server will start running, following CBS rules (thus without breaking SCHED_DEADLINE). Then the server will continue the running state (without deferring) until it fair tasks are able to execute as regular fair scheduler (end of the starvation). Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/dd175943c72533cd9f0b87767c6499204879cc38.1716811044.git.bristot@kernel.org
2024-07-29sched/fair: Add trivial fair serverPeter Zijlstra
Use deadline servers to service fair tasks. This patch adds a fair_server deadline entity which acts as a container for fair entities and can be used to fix starvation when higher priority (wrt fair) tasks are monopolizing CPU(s). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/b6b0bcefaf25391bcf5b6ecdb9f1218de402d42e.1716811044.git.bristot@kernel.org
2024-07-29sched/core: Clear prev->dl_server in CFS pick fast pathYoussef Esmat
In case the previous pick was a DL server pick, ->dl_server might be set. Clear it in the fast path as well. Fixes: 63ba8422f876 ("sched/deadline: Introduce deadline servers") Signed-off-by: Youssef Esmat <youssefesmat@google.com> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Juri Lelli <juri.lelli@redhat.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/7f7381ccba09efcb4a1c1ff808ed58385eccc222.1716811044.git.bristot@kernel.org
2024-07-29sched/core: Add clearing of ->dl_server in put_prev_task_balance()Joel Fernandes (Google)
Paths using put_prev_task_balance() need to do a pick shortly after. Make sure they also clear the ->dl_server on prev as a part of that. Fixes: 63ba8422f876 ("sched/deadline: Introduce deadline servers") Signed-off-by: "Joel Fernandes (Google)" <joel@joelfernandes.org> Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Juri Lelli <juri.lelli@redhat.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/d184d554434bedbad0581cb34656582d78655150.1716811044.git.bristot@kernel.org
2024-07-29sched/fair: Make SCHED_IDLE entity be preempted in strict hierarchyTianchen Ding
Consider the following cgroup: root | ------------------------ | | normal_cgroup idle_cgroup | | SCHED_IDLE task_A SCHED_NORMAL task_B According to the cgroup hierarchy, A should preempt B. But current check_preempt_wakeup_fair() treats cgroup se and task separately, so B will preempt A unexpectedly. Unify the wakeup logic by {c,p}se_is_idle only. This makes SCHED_IDLE of a task a relative policy that is effective only within its own cgroup, similar to the behavior of NICE. Also fix se_is_idle() definition when !CONFIG_FAIR_GROUP_SCHED. Fixes: 304000390f88 ("sched: Cgroup SCHED_IDLE support") Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josh Don <joshdon@google.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20240626023505.1332596-1-dtcccc@linux.alibaba.com
2024-07-29sched: remove HZ_BW feature hedgePhil Auld
As a hedge against unexpected user issues commit 88c56cfeaec4 ("sched/fair: Block nohz tick_stop when cfs bandwidth in use") included a scheduler feature to disable the new functionality. It's been a few releases (v6.6) and no screams, so remove it. Signed-off-by: Phil Auld <pauld@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20240515133705.3632915-1-pauld@redhat.com
2024-07-29sched/fair: Remove cfs_rq::nr_spread_over and cfs_rq::exec_clockChuyi Zhou
nr_spread_over tracks the number of instances where the difference between a scheduling entity's virtual runtime and the minimum virtual runtime in the runqueue exceeds three times the scheduler latency, indicating significant disparity in task scheduling. Commit that removed its usage: 5e963f2bd: sched/fair: Commit to EEVDF cfs_rq->exec_clock was used to account for time spent executing tasks. Commit that removed its usage: 5d69eca542ee1 sched: Unify runtime accounting across classes cfs_rq::nr_spread_over and cfs_rq::exec_clock are not used anymore in eevdf. Remove them from struct cfs_rq. Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Acked-by: Vishal Chourasia <vishalc@linux.ibm.com> Link: https://lore.kernel.org/r/20240717143342.593262-1-zhouchuyi@bytedance.com
2024-07-29sched/core: Add WARN_ON_ONCE() to check overflow for migrate_disable()Peilin He
Background ========== When repeated migrate_disable() calls are made with missing the corresponding migrate_enable() calls, there is a risk of 'migration_disabled' going upper overflow because 'migration_disabled' is a type of unsigned short whose max value is 65535. In PREEMPT_RT kernel, if 'migration_disabled' goes upper overflow, it may make the migrate_disable() ineffective within local_lock_irqsave(). This is because, during the scheduling procedure, the value of 'migration_disabled' will be checked, which can trigger CPU migration. Consequently, the count of 'rcu_read_lock_nesting' may leak due to local_lock_irqsave() and local_unlock_irqrestore() occurring on different CPUs. Usecase ======== For example, When I developed a driver, I encountered a warning like "WARNING: CPU: 4 PID: 260 at kernel/rcu/tree_plugin.h:315 rcu_note_context_switch+0xa8/0x4e8" warning. It took me half a month to locate this issue. Ultimately, I discovered that the lack of upper overflow detection mechanism in migrate_disable() was the root cause, leading to a significant amount of time spent on problem localization. If the upper overflow detection mechanism was added to migrate_disable(), the root cause could be very quickly and easily identified. Effect ====== Using WARN_ON_ONCE() to check if 'migration_disabled' is upper overflow can help developers identify the issue quickly. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Peilin He<he.peilin@zte.com.cn> Signed-off-by: xu xin <xu.xin16@zte.com.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Yunkai Zhang <zhang.yunkai@zte.com.cn> Reviewed-by: Qiang Tu <tu.qiang35@zte.com.cn> Reviewed-by: Kun Jiang <jiang.kun2@zte.com.cn> Reviewed-by: Fan Yu <fan.yu9@zte.com.cn> Link: https://lkml.kernel.org/r/20240716104244764N2jD8gnBpnsLjCDnQGQ8c@zte.com.cn
2024-07-29sched: Initialize the vruntime of a new task when it is first enqueuedZhang Qiao
When creating a new task, we initialize vruntime of the newly task at sched_cgroup_fork(). However, the timing of executing this action is too early and may not be accurate. Because it uses current CPU to init the vruntime, but the new task actually runs on the cpu which be assigned at wake_up_new_task(). To optimize this case, we pass ENQUEUE_INITIAL flag to activate_task() in wake_up_new_task(), in this way, when place_entity is called in enqueue_entity(), the vruntime of the new task will be initialized. In addition, place_entity() in task_fork_fair() was introduced for two reasons: 1. Previously, the __enqueue_entity() was in task_new_fair(), in order to provide vruntime for enqueueing the newly task, the vruntime assignment equation "se->vruntime = cfs_rq->min_vruntime" was introduced by commit e9acbff6484d ("sched: introduce se->vruntime"). This is the initial state of place_entity(). 2. commit 4d78e7b656aa ("sched: new task placement for vruntime") added child_runs_first task placement feature which based on vruntime, this also requires the new task's vruntime value. After removing the child_runs_first and enqueue_entity() from task_fork_fair(), this place_entity() no longer makes sense, so remove it also. Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20240627133359.1370598-1-zhangqiao22@huawei.com
2024-07-29sched/core: Fix unbalance set_rq_online/offline() in sched_cpu_deactivate()Yang Yingliang
If cpuset_cpu_inactive() fails, set_rq_online() need be called to rollback. Fixes: 120455c514f7 ("sched: Fix hotplug vs CPU bandwidth control") Cc: stable@kernel.org Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240703031610.587047-5-yangyingliang@huaweicloud.com
2024-07-29sched/core: Introduce sched_set_rq_on/offline() helperYang Yingliang
Introduce sched_set_rq_on/offline() helper, so it can be called in normal or error path simply. No functional changed. Cc: stable@kernel.org Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240703031610.587047-4-yangyingliang@huaweicloud.com
2024-07-29sched/smt: Fix unbalance sched_smt_present dec/incYang Yingliang
I got the following warn report while doing stress test: jump label: negative count! WARNING: CPU: 3 PID: 38 at kernel/jump_label.c:263 static_key_slow_try_dec+0x9d/0xb0 Call Trace: <TASK> __static_key_slow_dec_cpuslocked+0x16/0x70 sched_cpu_deactivate+0x26e/0x2a0 cpuhp_invoke_callback+0x3ad/0x10d0 cpuhp_thread_fun+0x3f5/0x680 smpboot_thread_fn+0x56d/0x8d0 kthread+0x309/0x400 ret_from_fork+0x41/0x70 ret_from_fork_asm+0x1b/0x30 </TASK> Because when cpuset_cpu_inactive() fails in sched_cpu_deactivate(), the cpu offline failed, but sched_smt_present is decremented before calling sched_cpu_deactivate(), it leads to unbalanced dec/inc, so fix it by incrementing sched_smt_present in the error path. Fixes: c5511d03ec09 ("sched/smt: Make sched_smt_present track topology") Cc: stable@kernel.org Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chen Yu <yu.c.chen@intel.com> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Link: https://lore.kernel.org/r/20240703031610.587047-3-yangyingliang@huaweicloud.com
2024-07-29sched/smt: Introduce sched_smt_present_inc/dec() helperYang Yingliang
Introduce sched_smt_present_inc/dec() helper, so it can be called in normal or error path simply. No functional changed. Cc: stable@kernel.org Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240703031610.587047-2-yangyingliang@huaweicloud.com
2024-07-29sched/cputime: Fix mul_u64_u64_div_u64() precision for cputimeZheng Zucheng
In extreme test scenarios: the 14th field utime in /proc/xx/stat is greater than sum_exec_runtime, utime = 18446744073709518790 ns, rtime = 135989749728000 ns In cputime_adjust() process, stime is greater than rtime due to mul_u64_u64_div_u64() precision problem. before call mul_u64_u64_div_u64(), stime = 175136586720000, rtime = 135989749728000, utime = 1416780000. after call mul_u64_u64_div_u64(), stime = 135989949653530 unsigned reversion occurs because rtime is less than stime. utime = rtime - stime = 135989749728000 - 135989949653530 = -199925530 = (u64)18446744073709518790 Trigger condition: 1). User task run in kernel mode most of time 2). ARM64 architecture 3). TICK_CPU_ACCOUNTING=y CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is not set Fix mul_u64_u64_div_u64() conversion precision by reset stime to rtime Fixes: 3dc167ba5729 ("sched/cputime: Improve cputime_adjust()") Signed-off-by: Zheng Zucheng <zhengzucheng@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20240726023235.217771-1-zhengzucheng@huawei.com
2024-07-29locking/pvqspinlock: Correct the type of "old" variable in pv_kick_node()Uros Bizjak
"enum vcpu_state" is not compatible with "u8" type for all targets, resulting in: error: initialization of 'u8 *' {aka 'unsigned char *'} from incompatible pointer type 'enum vcpu_state *' for LoongArch. Correct the type of "old" variable to "u8". Fixes: fea0e1820b51 ("locking/pvqspinlock: Use try_cmpxchg() in qspinlock_paravirt.h") Closes: https://lore.kernel.org/lkml/20240719024010.3296488-1-maobibo@loongson.cn/ Reported-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Waiman Long <longman@redhat.com> Link: https://lore.kernel.org/r/20240721164552.50175-1-ubizjak@gmail.com
2024-07-29Merge drm/drm-next into drm-misc-nextThomas Zimmermann
Backmerging to get a late RC of v6.10 before moving into v6.11. Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
2024-07-29locking/csd_lock: Print large numbers as negativesPaul E. McKenney
The CSD-lock-hold diagnostics from CONFIG_CSD_LOCK_WAIT_DEBUG are printed in nanoseconds as unsigned long longs, which is a bit obtuse for human readers when timing bugs result in negative CSD-lock hold times. Yes, there are some people to whom it is immediately obvious that 18446744073709551615 is really -1, but for the rest of us... Therefore, print these numbers as signed long longs, making the negative hold times immediately apparent. Reported-by: Rik van Riel <riel@surriel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Imran Khan <imran.f.khan@oracle.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Leonardo Bras <leobras@redhat.com> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Reviewed-by: Rik van Riel <riel@surriel.com> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-07-29rcu/kfree: Warn on unexpected tail statePaul E. McKenney
Within the rcu_sr_normal_gp_cleanup_work() function, there is an acquire load from rcu_state.srs_done_tail, which is expected to be non-NULL. This commit adds a WARN_ON_ONCE() to check this expectation. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-07-29rcutorture: Make rcu_torture_write_types() print number of update typesPaul E. McKenney
This commit follows the list of update types with their count, resulting in console output like this: rcu_torture_write_types: Testing conditional GPs. rcu_torture_write_types: Testing conditional expedited GPs. rcu_torture_write_types: Testing conditional full-state GPs. rcu_torture_write_types: Testing expedited GPs. rcu_torture_write_types: Testing asynchronous GPs. rcu_torture_write_types: Testing polling GPs. rcu_torture_write_types: Testing polling full-state GPs. rcu_torture_write_types: Testing polling expedited GPs. rcu_torture_write_types: Testing polling full-state expedited GPs. rcu_torture_write_types: Testing normal GPs. rcu_torture_write_types: Testing 10 update types This commit adds the final line giving the count. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-07-29rcutorture: Generic test for NUM_ACTIVE_*RCU_POLL*Paul E. McKenney
The rcutorture test suite has specific tests for both of the NUM_ACTIVE_RCU_POLL_OLDSTATE and NUM_ACTIVE_RCU_POLL_FULL_OLDSTATE macros provided for RCU polled grace periods. However, with the advent of NUM_ACTIVE_SRCU_POLL_OLDSTATE, a more generic test is needed. This commit therefore adds ->poll_active and ->poll_active_full fields to the rcu_torture_ops structure and converts the existing specific tests to use these fields, when present. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-07-29rcutorture: Add SRCU ->same_gp_state and ->get_comp_state functionsPaul E. McKenney
This commit points the SRCU ->same_gp_state and ->get_comp_state fields to same_state_synchronize_srcu() and get_completed_synchronize_srcu(), allowing them to be tested. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-07-29rcutorture: Remove redundant rcu_torture_ops get_gp_completed fieldsPaul E. McKenney
The rcu_torture_ops structure's ->get_gp_completed and ->get_gp_completed_full fields are redundant with its ->get_comp_state and ->get_comp_state_full fields. This commit therefore removes the former in favor of the latter. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-07-29rcu/nocb: Remove SEGCBLIST_RCU_COREFrederic Weisbecker
RCU core can't be running anymore while in the middle of (de-)offloading since this sort of transition now only applies to offline CPUs. The SEGCBLIST_RCU_CORE state can therefore be removed. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-07-29rcu/nocb: Remove halfway (de-)offloading handling from rcu_coreFrederic Weisbecker
RCU core can't be running anymore while in the middle of (de-)offloading since this sort of transition now only applies to offline CPUs. The locked callback acceleration handling during the transition can therefore be removed, along with concurrent batch execution. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-07-29rcu/nocb: Remove halfway (de-)offloading handling from rcu_core()'s QS reportingFrederic Weisbecker
RCU core can't be running anymore while in the middle of (de-)offloading since this sort of transition now only applies to offline CPUs. The locked callback acceleration handling during the transition can therefore be removed. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-07-29rcu/nocb: Remove halfway (de-)offloading handling from bypassFrederic Weisbecker
Bypass enqueue can't happen anymore in the middle of (de-)offloading since this sort of transition now only applies to offline CPUs. The related safety check can therefore be removed. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-07-29rcu/nocb: (De-)offload callbacks on offline CPUs onlyFrederic Weisbecker
Currently callbacks can be (de-)offloaded only on online CPUs. This involves an overly elaborated state machine in order to make sure that callbacks are always handled during the process while ensuring synchronization between rcu_core and NOCB kthreads. The only potential user of NOCB (de-)offloading appears to be a nohz_full toggling interface through cpusets. And the general agreement is now to work toward toggling the nohz_full state on offline CPUs to simplify the whole picture. Therefore, convert the (de-)offloading to only support offline CPUs. This involves the following changes: * Call rcu_barrier() before deoffloading. An offline offloaded CPU may still carry callbacks in its queue ignored by rcutree_migrate_callbacks(). Those callbacks must all be flushed before switching to a regular queue because no more kthreads will handle those before the CPU ever gets re-onlined. This means that further calls to rcu_barrier() will find an empty queue until the CPU goes through rcutree_report_cpu_starting(). As a result it is guaranteed that further rcu_barrier() won't try to lock the nocb_lock for that target and thus won't risk an imbalance. Therefore barrier_mutex doesn't need to be locked anymore upon deoffloading. * Assume the queue is empty before offloading, as rcutree_migrate_callbacks() took care of everything. This means that further calls to rcu_barrier() will find an empty queue until the CPU goes through rcutree_report_cpu_starting(). As a result it is guaranteed that further rcu_barrier() won't risk a nocb_lock imbalance. Therefore barrier_mutex doesn't need to be locked anymore upon offloading. * No need to flush bypass anymore. Further simplifications will follow in upcoming patches. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-07-29rcu/nocb: Introduce nocb mutexFrederic Weisbecker
The barrier_mutex is used currently to protect (de-)offloading operations and prevent from nocb_lock locking imbalance in rcu_barrier() and shrinker, and also from misordered RCU barrier invocation. Now since RCU (de-)offloading is going to happen on offline CPUs, an RCU barrier will have to be executed while transitionning from offloaded to de-offloaded state. And this can't happen while holding the barrier_mutex. Introduce a NOCB mutex to protect (de-)offloading transitions. The barrier_mutex is still held for now when necessary to avoid barrier callbacks reordering and nocb_lock imbalance. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-07-29rcu/nocb: Assert no callbacks while nocb kthread allocation failsFrederic Weisbecker
When a NOCB CPU fails to create a nocb kthread on bringup, the CPU is then deoffloaded. The barrier mutex is locked at this stage. It is typically used to protect against concurrent (de-)offloading and/or concurrent rcu_barrier() that would otherwise risk a nocb locking imbalance. However: * rcu_barrier() can't run concurrently if it's the boot CPU on early boot-up. * rcu_barrier() can run concurrently if it's a secondary CPU but it is expected to see 0 callbacks on this target because it's the first time it boots. * (de-)offloading can't happen concurrently with smp_init(), as rcutorture is initialized later, at least not before device_initcall(), and userspace isn't available yet. * (de-)offloading can't happen concurrently with cpu_up(), courtesy of cpu_hotplug_lock. But: * The lazy shrinker might run concurrently with cpu_up(). It shouldn't try to grab the nocb_lock and risk an imbalance due to lazy_len supposed to be 0 but be extra cautious. * Also be cautious against resume from hibernation potential subtleties. So keep the locking and add some assertions and comments. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-07-29rcu/nocb: Move nocb field at the end of state structFrederic Weisbecker
nocb_is_setup is a rarely used field, mostly on boot and CPU hotplug. It shouldn't occupy the middle of the rcu state hot fields cacheline. Move it to the end and build it conditionally while at it. More cold NOCB fields are to come. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-07-29rcu/nocb: Introduce RCU_NOCB_LOCKDEP_WARN()Frederic Weisbecker
Checking for races against concurrent (de-)offloading implies the creation of !CONFIG_RCU_NOCB_CPU stubs to check if each relevant lock is held. For now this only implies the nocb_lock but more are to be expected. Create instead a NOCB specific version of RCU_LOCKDEP_WARN() to avoid the proliferation of stubs. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-07-29context_tracking, rcu: Rename ct_dynticks_cpu_acquire() into ↵Valentin Schneider
ct_rcu_watching_cpu_acquire() The context_tracking.state RCU_DYNTICKS subvariable has been renamed to RCU_WATCHING, reflect that change in the related helpers. Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-07-29context_tracking, rcu: Rename ct_dynticks_cpu() into ct_rcu_watching_cpu()Valentin Schneider
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to RCU_WATCHING, reflect that change in the related helpers. Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-07-29context_tracking, rcu: Rename ct_dynticks() into ct_rcu_watching()Valentin Schneider
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to RCU_WATCHING, reflect that change in the related helpers. Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-07-29context_tracking, rcu: Rename RCU_DYNTICKS_IDX into CT_RCU_WATCHINGValentin Schneider
The symbols relating to the CT_STATE part of context_tracking.state are now all prefixed with CT_STATE. The RCU dynticks counter part of that atomic variable still involves symbols with different prefixes, align them all to be prefixed with CT_RCU_WATCHING. Suggested-by: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-07-29treewide: context_tracking: Rename CONTEXT_* into CT_STATE_*Valentin Schneider
Context tracking state related symbols currently use a mix of the CONTEXT_ (e.g. CONTEXT_KERNEL) and CT_SATE_ (e.g. CT_STATE_MASK) prefixes. Clean up the naming and make the ctx_state enum use the CT_STATE_ prefix. Suggested-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Acked-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-07-28minmax: make generic MIN() and MAX() macros available everywhereLinus Torvalds
This just standardizes the use of MIN() and MAX() macros, with the very traditional semantics. The goal is to use these for C constant expressions and for top-level / static initializers, and so be able to simplify the min()/max() macros. These macro names were used by various kernel code - they are very traditional, after all - and all such users have been fixed up, with a few different approaches: - trivial duplicated macro definitions have been removed Note that 'trivial' here means that it's obviously kernel code that already included all the major kernel headers, and thus gets the new generic MIN/MAX macros automatically. - non-trivial duplicated macro definitions are guarded with #ifndef This is the "yes, they define their own versions, but no, the include situation is not entirely obvious, and maybe they don't get the generic version automatically" case. - strange use case #1 A couple of drivers decided that the way they want to describe their versioning is with #define MAJ 1 #define MIN 2 #define DRV_VERSION __stringify(MAJ) "." __stringify(MIN) which adds zero value and I just did my Alexander the Great impersonation, and rewrote that pointless Gordian knot as #define DRV_VERSION "1.2" instead. - strange use case #2 A couple of drivers thought that it's a good idea to have a random 'MIN' or 'MAX' define for a value or index into a table, rather than the traditional macro that takes arguments. These values were re-written as C enum's instead. The new function-line macros only expand when followed by an open parenthesis, and thus don't clash with enum use. Happily, there weren't really all that many of these cases, and a lot of users already had the pattern of using '#ifndef' guarding (or in one case just using '#undef MIN') before defining their own private version that does the same thing. I left such cases alone. Cc: David Laight <David.Laight@aculab.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-07-27Merge tag 'timers-urgent-2024-07-26' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer migration updates from Thomas Gleixner: "Fixes and minor updates for the timer migration code: - Stop testing the group->parent pointer as it is not guaranteed to be stable over a chain of operations by design. This includes a warning which would be nice to have but it produces false positives due to the racy nature of the check. - Plug a race between CPUs going in and out of idle and a CPU hotplug operation. The latter can create and connect a new hierarchy level which is missed in the concurrent updates of CPUs which go into idle. As a result the events of such a CPU might not be processed and timers go stale. Cure it by splitting the hotplug operation into a prepare and online callback. The prepare callback is guaranteed to run on an online and therefore active CPU. This CPU updates the hierarchy and being online ensures that there is always at least one migrator active which handles the modified hierarchy correctly when going idle. The online callback which runs on the incoming CPU then just marks the CPU active and brings it into operation. - Improve tracing and polish the code further so it is more obvious what's going on" * tag 'timers-urgent-2024-07-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timers/migration: Fix grammar in comment timers/migration: Spare write when nothing changed timers/migration: Rename childmask by groupmask to make naming more obvious timers/migration: Read childmask and parent pointer in a single place timers/migration: Use a single struct for hierarchy walk data timers/migration: Improve tracing timers/migration: Move hierarchy setup into cpuhotplug prepare callback timers/migration: Do not rely always on group->parent
2024-07-25Merge tag 'net-6.11-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from bpf and netfilter. A lot of networking people were at a conference last week, busy catching COVID, so relatively short PR. Current release - regressions: - tcp: process the 3rd ACK with sk_socket for TFO and MPTCP Current release - new code bugs: - l2tp: protect session IDR and tunnel session list with one lock, make sure the state is coherent to avoid a warning - eth: bnxt_en: update xdp_rxq_info in queue restart logic - eth: airoha: fix location of the MBI_RX_AGE_SEL_MASK field Previous releases - regressions: - xsk: require XDP_UMEM_TX_METADATA_LEN to actuate tx_metadata_len, the field reuses previously un-validated pad Previous releases - always broken: - tap/tun: drop short frames to prevent crashes later in the stack - eth: ice: add a per-VF limit on number of FDIR filters - af_unix: disable MSG_OOB handling for sockets in sockmap/sockhash" * tag 'net-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (34 commits) tun: add missing verification for short frame tap: add missing verification for short frame mISDN: Fix a use after free in hfcmulti_tx() gve: Fix an edge case for TSO skb validity check bnxt_en: update xdp_rxq_info in queue restart logic tcp: process the 3rd ACK with sk_socket for TFO/MPTCP selftests/bpf: Add XDP_UMEM_TX_METADATA_LEN to XSK TX metadata test xsk: Require XDP_UMEM_TX_METADATA_LEN to actuate tx_metadata_len bpf: Fix a segment issue when downgrading gso_size net: mediatek: Fix potential NULL pointer dereference in dummy net_device handling MAINTAINERS: make Breno the netconsole maintainer MAINTAINERS: Update bonding entry net: nexthop: Initialize all fields in dumped nexthops net: stmmac: Correct byte order of perfect_match selftests: forwarding: skip if kernel not support setting bridge fdb learning limit tipc: Return non-zero value from tipc_udp_addr2str() on error netfilter: nft_set_pipapo_avx2: disable softinterrupts ice: Fix recipe read procedure ice: Add a per-VF limit on number of FDIR filters net: bonding: correctly annotate RCU in bond_should_notify_peers() ...
2024-07-25Merge tag 'printk-for-6.11-trivial' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux Pull printk updates from Petr Mladek: - trivial printk changes The bigger "real" printk work is still being discussed. * tag 'printk-for-6.11-trivial' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: vsprintf: add missing MODULE_DESCRIPTION() macro printk: Rename console_replay_all() and update context
2024-07-25Merge tag 'constfy-sysctl-6.11-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl Pull sysctl constification from Joel Granados: "Treewide constification of the ctl_table argument of proc_handlers using a coccinelle script and some manual code formatting fixups. This is a prerequisite to moving the static ctl_table structs into read-only data section which will ensure that proc_handler function pointers cannot be modified" * tag 'constfy-sysctl-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl: sysctl: treewide: constify the ctl_table argument of proc_handlers
2024-07-25Merge tag 'kgdb-6.11-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux Pull kgdb updates from Daniel Thompson: "Three small changes this cycle: - Clean up an architecture abstraction that is no longer needed because all the architectures have converged. - Actually use the prompt argument to kdb_position_cursor() instead of ignoring it (functionally this fix is a nop but that was due to luck rather than good judgement) - Fix a -Wformat-security warning" * tag 'kgdb-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux: kdb: Get rid of redundant kdb_curr_task() kdb: Use the passed prompt in kdb_position_cursor() kdb: address -Wformat-security warnings
2024-07-25Merge tag 'dma-mapping-6.11-2024-07-24' of ↵Linus Torvalds
git://git.infradead.org/users/hch/dma-mapping Pull dma-mapping fix from Christoph Hellwig: - fix the order of actions in dmam_free_coherent (Lance Richardson) * tag 'dma-mapping-6.11-2024-07-24' of git://git.infradead.org/users/hch/dma-mapping: dma: fix call order in dmam_free_coherent