summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2024-08-17sched: Teach dequeue_task() about special task statesPeter Zijlstra
Since special task states must not suffer spurious wakeups, and the proposed delayed dequeue can cause exactly these (under some boundary conditions), propagate this knowledge into dequeue_task() such that it can do the right thing. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105030.110439521@infradead.org
2024-08-17sched,freezer: Mark TASK_FROZEN specialPeter Zijlstra
The special task states are those that do not suffer spurious wakeups, TASK_FROZEN is very much one of those, mark it as such. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105029.998329901@infradead.org
2024-08-17sched/fair: Implement ENQUEUE_DELAYEDPeter Zijlstra
Doing a wakeup on a delayed dequeue task is about as simple as it sounds -- remove the delayed mark and enjoy the fact it was actually still on the runqueue. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105029.888107381@infradead.org
2024-08-17sched/fair: Prepare pick_next_task() for delayed dequeuePeter Zijlstra
Delayed dequeue's natural end is when it gets picked again. Ensure pick_next_task() knows what to do with delayed tasks. Note, this relies on the earlier patch that made pick_next_task() state invariant -- it will restart the pick on dequeue, because obviously the just dequeued task is no longer eligible. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105029.747330118@infradead.org
2024-08-17sched/fair: Prepare exit/cleanup paths for delayed_dequeuePeter Zijlstra
When dequeue_task() is delayed it becomes possible to exit a task (or cgroup) that is still enqueued. Ensure things are dequeued before freeing. Thanks to Valentin for asking the obvious questions and making switched_from_fair() less weird. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105029.631948434@infradead.org
2024-08-17sched/fair: Assert {set_next,put_prev}_entity() are properly balancedPeter Zijlstra
Just a little sanity test.. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105029.486423066@infradead.org
2024-08-17sched/uclamg: Handle delayed dequeuePeter Zijlstra
Delayed dequeue has tasks sit around on the runqueue that are not actually runnable -- specifically, they will be dequeued the moment they get picked. One side-effect is that such a task can get migrated, which leads to a 'nested' dequeue_task() scenario that messes up uclamp if we don't take care. Notably, dequeue_task(DEQUEUE_SLEEP) can 'fail' and keep the task on the runqueue. This however will have removed the task from uclamp -- per uclamp_rq_dec() in dequeue_task(). So far so good. However, if at that point the task gets migrated -- or nice adjusted or any of a myriad of operations that does a dequeue-enqueue cycle -- we'll pass through dequeue_task()/enqueue_task() again. Without modification this will lead to a double decrement for uclamp, which is wrong. Reported-by: Luis Machado <luis.machado@arm.com> Reported-by: Hongyan Xia <hongyan.xia2@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105029.315205425@infradead.org
2024-08-17sched: Prepare generic code for delayed dequeuePeter Zijlstra
While most of the delayed dequeue code can be done inside the sched_class itself, there is one location where we do not have an appropriate hook, namely ttwu_runnable(). Add an ENQUEUE_DELAYED call to the on_rq path to deal with waking delayed dequeue tasks. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105029.200000445@infradead.org
2024-08-17sched: Split DEQUEUE_SLEEP from deactivate_task()Peter Zijlstra
As a preparation for dequeue_task() failing, and a second code-path needing to take care of the 'success' path, split out the DEQEUE_SLEEP path from deactivate_task(). Much thanks to Libo for spotting and fixing a TASK_ON_RQ_MIGRATING ordering fail. Fixed-by: Libo Chen <libo.chen@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105029.086192709@infradead.org
2024-08-17sched/fair: Re-organize dequeue_task_fair()Peter Zijlstra
Working towards delaying dequeue, notably also inside the hierachy, rework dequeue_task_fair() such that it can 'resume' an interrupted hierarchy walk. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105028.977256873@infradead.org
2024-08-17sched: Allow sched_class::dequeue_task() to failPeter Zijlstra
Change the function signature of sched_class::dequeue_task() to return a boolean, allowing future patches to 'fail' dequeue. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105028.864630153@infradead.org
2024-08-17sched/fair: Unify pick_{,next_}_task_fair()Peter Zijlstra
Implement pick_next_task_fair() in terms of pick_task_fair() to de-duplicate the pick loop. More importantly, this makes all the pick loops use the state-invariant form, which is useful to introduce further re-try conditions in later patches. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105028.725062368@infradead.org
2024-08-17sched/fair: Cleanup pick_task_fair()'s currPeter Zijlstra
With 4c456c9ad334 ("sched/fair: Remove unused 'curr' argument from pick_next_entity()") curr is no longer being used, so no point in clearing it. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105028.614707623@infradead.org
2024-08-17sched/fair: Cleanup pick_task_fair() vs throttlePeter Zijlstra
Per 54d27365cae8 ("sched/fair: Prevent throttling in early pick_next_task_fair()") the reason check_cfs_rq_runtime() is under the 'if (curr)' check is to ensure the (downward) traversal does not result in an empty cfs_rq. But then the pick_task_fair() 'copy' of all this made it restart the traversal anyway, so that seems to solve the issue too. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ben Segall <bsegall@google.com> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105028.501679876@infradead.org
2024-08-17sched/eevdf: Remove min_vruntime_copyPeter Zijlstra
Since commit e8f331bcc270 ("sched/smp: Use lag to simplify cross-runqueue placement") the min_vruntime_copy is no longer used. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105028.395297941@infradead.org
2024-08-17sched/eevdf: Add feature commentsPeter Zijlstra
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Valentin Schneider <vschneid@redhat.com> Link: https://lkml.kernel.org/r/20240727105028.287790895@infradead.org
2024-08-17padata: use integer wrap around to prevent deadlock on seq_nr overflowVanGiang Nguyen
When submitting more than 2^32 padata objects to padata_do_serial, the current sorting implementation incorrectly sorts padata objects with overflowed seq_nr, causing them to be placed before existing objects in the reorder list. This leads to a deadlock in the serialization process as padata_find_next cannot match padata->seq_nr and pd->processed because the padata instance with overflowed seq_nr will be selected next. To fix this, we use an unsigned integer wrap around to correctly sort padata objects in scenarios with integer overflow. Fixes: bfde23ce200e ("padata: unbind parallel jobs from specific CPUs") Cc: <stable@vger.kernel.org> Co-developed-by: Christian Gafert <christian.gafert@rohde-schwarz.com> Signed-off-by: Christian Gafert <christian.gafert@rohde-schwarz.com> Co-developed-by: Max Ferger <max.ferger@rohde-schwarz.com> Signed-off-by: Max Ferger <max.ferger@rohde-schwarz.com> Signed-off-by: Van Giang Nguyen <vangiang.nguyen@rohde-schwarz.com> Acked-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-08-16Merge tag 'trace-v6.11-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: "A couple of fixes for tracing: - Prevent a NULL pointer dereference in the error path of RTLA tool - Fix an infinite loop bug when reading from the ring buffer when closed. If there's a thread trying to read the ring buffer and it gets closed by another thread, the one reading will go into an infinite loop when the buffer is empty instead of exiting back to user space" * tag 'trace-v6.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: rtla/osnoise: Prevent NULL dereference in error handling tracing: Return from tracing_buffers_read() if the file has been closed
2024-08-15crash: fix riscv64 crash memory reserve dead loopJinjie Ruan
On RISCV64 Qemu machine with 512MB memory, cmdline "crashkernel=500M,high" will cause system stall as below: Zone ranges: DMA32 [mem 0x0000000080000000-0x000000009fffffff] Normal empty Movable zone start for each node Early memory node ranges node 0: [mem 0x0000000080000000-0x000000008005ffff] node 0: [mem 0x0000000080060000-0x000000009fffffff] Initmem setup node 0 [mem 0x0000000080000000-0x000000009fffffff] (stall here) commit 5d99cadf1568 ("crash: fix x86_32 crash memory reserve dead loop bug") fix this on 32-bit architecture. However, the problem is not completely solved. If `CRASH_ADDR_LOW_MAX = CRASH_ADDR_HIGH_MAX` on 64-bit architecture, for example, when system memory is equal to CRASH_ADDR_LOW_MAX on RISCV64, the following infinite loop will also occur: -> reserve_crashkernel_generic() and high is true -> alloc at [CRASH_ADDR_LOW_MAX, CRASH_ADDR_HIGH_MAX] fail -> alloc at [0, CRASH_ADDR_LOW_MAX] fail and repeatedly (because CRASH_ADDR_LOW_MAX = CRASH_ADDR_HIGH_MAX). As Catalin suggested, do not remove the ",high" reservation fallback to ",low" logic which will change arm64's kdump behavior, but fix it by skipping the above situation similar to commit d2f32f23190b ("crash: fix x86_32 crash memory reserve dead loop"). After this patch, it print: cannot allocate crashkernel (size:0x1f400000) Link: https://lkml.kernel.org/r/20240812062017.2674441-1-ruanjinjie@huawei.com Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Suggested-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Dave Young <dyoung@redhat.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-08-15bpf: Remove __btf_name_valid() and change to btf_name_valid_identifier()Jeongjun Park
__btf_name_valid() can be completely replaced with btf_name_valid_identifier, and since most of the time you already call btf_name_valid_identifier instead of __btf_name_valid , it would be appropriate to rename the __btf_name_valid function to btf_name_valid_identifier and remove __btf_name_valid. Signed-off-by: Jeongjun Park <aha310510@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Alan Maguire <alan.maguire@oracle.com> Link: https://lore.kernel.org/bpf/20240807143110.181497-1-aha310510@gmail.com
2024-08-15Merge tag 'hardening-v6.11-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening fixes from Kees Cook: - gcc-plugins: randstruct: Remove GCC 4.7 or newer requirement (Thorsten Blum) - kallsyms: Clean up interaction with LTO suffixes (Song Liu) - refcount: Report UAF for refcount_sub_and_test(0) when counter==0 (Petr Pavlu) - kunit/overflow: Avoid misallocation of driver name (Ivan Orlov) * tag 'hardening-v6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: kallsyms: Match symbols exactly with CONFIG_LTO_CLANG kallsyms: Do not cleanup .llvm.<hash> suffix before sorting symbols kunit/overflow: Fix UB in overflow_allocation_test gcc-plugins: randstruct: Remove GCC 4.7 or newer requirement refcount: Report UAF for refcount_sub_and_test(0) when counter==0
2024-08-15kallsyms: Match symbols exactly with CONFIG_LTO_CLANGSong Liu
With CONFIG_LTO_CLANG=y, the compiler may add .llvm.<hash> suffix to function names to avoid duplication. APIs like kallsyms_lookup_name() and kallsyms_on_each_match_symbol() tries to match these symbol names without the .llvm.<hash> suffix, e.g., match "c_stop" with symbol c_stop.llvm.17132674095431275852. This turned out to be problematic for use cases that require exact match, for example, livepatch. Fix this by making the APIs to match symbols exactly. Also cleanup kallsyms_selftests accordingly. Signed-off-by: Song Liu <song@kernel.org> Fixes: 8cc32a9bbf29 ("kallsyms: strip LTO-only suffixes from promoted global functions") Tested-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20240807220513.3100483-3-song@kernel.org Signed-off-by: Kees Cook <kees@kernel.org>
2024-08-15context_tracking, rcu: Rename rcu_dyntick trace event into rcu_watchingValentin Schneider
The "rcu_dyntick" naming convention has been turned into "rcu_watching" for all helpers now, align the trace event to that. To add to the confusion, the strings passed to the trace event are now reversed: when RCU "starts" the dyntick / EQS state, it "stops" watching. Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-15rcu: Rename rcu_momentary_dyntick_idle() into rcu_momentary_eqs()Valentin Schneider
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to RCU_WATCHING, replace "dyntick_idle" into "eqs" to drop the dyntick reference. Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-15rcu: Rename rcu_implicit_dynticks_qs() into rcu_watching_snap_recheck()Valentin Schneider
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to RCU_WATCHING, drop the dyntick reference and update the name of this helper to express that it rechecks rdp->watching_snap after an earlier rcu_watching_snap_save(). Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-15rcu: Rename dyntick_save_progress_counter() into rcu_watching_snap_save()Valentin Schneider
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to RCU_WATCHING, and the 'dynticks' prefix can be dropped without losing any meaning. Suggested-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-15rcu: Rename struct rcu_data .exp_dynticks_snap into .exp_watching_snapValentin Schneider
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to RCU_WATCHING, and the snapshot helpers are now prefix by "rcu_watching". Reflect that change into the storage variables for these snapshots. Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-15rcu: Rename struct rcu_data .dynticks_snap into .watching_snapValentin Schneider
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to RCU_WATCHING, and the snapshot helpers are now prefix by "rcu_watching". Reflect that change into the storage variables for these snapshots. Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-15rcu: Rename rcu_dynticks_zero_in_eqs() into rcu_watching_zero_in_eqs()Valentin Schneider
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to RCU_WATCHING, reflect that change in the related helpers. Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-15rcu: Rename rcu_dynticks_in_eqs_since() into rcu_watching_snap_stopped_since()Valentin Schneider
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to RCU_WATCHING, the dynticks prefix can go. While at it, this helper is only meant to be called after failing an earlier call to rcu_watching_snap_in_eqs(), document this in the comments and add a WARN_ON_ONCE() for good measure. Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-15rcu: Rename rcu_dynticks_in_eqs() into rcu_watching_snap_in_eqs()Valentin Schneider
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to RCU_WATCHING, reflect that change in the related helpers. While at it, update a comment that still refers to rcu_dynticks_snap(), which was removed by commit: 7be2e6323b9b ("rcu: Remove full memory barrier on RCU stall printout") Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-15rcu: Rename rcu_dynticks_eqs_online() into rcu_watching_online()Valentin Schneider
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to RCU_WATCHING, reflect that change in the related helpers. Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-15context_tracking, rcu: Rename rcu_dynticks_curr_cpu_in_eqs() into ↵Valentin Schneider
rcu_is_watching_curr_cpu() The context_tracking.state RCU_DYNTICKS subvariable has been renamed to RCU_WATCHING, reflect that change in the related helpers. Note that "watching" is the opposite of "in EQS", so the negation is lifted out of the helper and into the callsites. Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-15context_tracking, rcu: Rename rcu_dynticks_task*() into rcu_task*()Valentin Schneider
The context_tracking.state RCU_DYNTICKS subvariable has been renamed to RCU_WATCHING, and the 'dynticks' prefix can be dropped without losing any meaning. While at it, flip the suffixes of these helpers. We are not telling that we are entering dynticks mode from an RCU-task perspective anymore; we are telling that we are exiting RCU-tasks because we are in eqs mode. [ neeraj.upadhyay: Incorporate Frederic Weisbecker feedback. ] Suggested-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-15rtmutex: Drop rt_mutex::wait_lock before schedulingRoland Xu
rt_mutex_handle_deadlock() is called with rt_mutex::wait_lock held. In the good case it returns with the lock held and in the deadlock case it emits a warning and goes into an endless scheduling loop with the lock held, which triggers the 'scheduling in atomic' warning. Unlock rt_mutex::wait_lock in the dead lock case before issuing the warning and dropping into the schedule for ever loop. [ tglx: Moved unlock before the WARN(), removed the pointless comment, massaged changelog, added Fixes tag ] Fixes: 3d5c9340d194 ("rtmutex: Handle deadlock detection smarter") Signed-off-by: Roland Xu <mu001999@outlook.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/ME0P300MB063599BEF0743B8FA339C2CECC802@ME0P300MB0635.AUSP300.PROD.OUTLOOK.COM
2024-08-15tracing/fgraph: Have fgraph handle previous boot function addressesSteven Rostedt
Update the function graph code to modify the function addresses for a previous boot buffer so that it matches the current kallsyms (note this does not handle module addresses, yet). After a reboot, instead of seeing: # trace-cmd show -B boot_mapped | tail -n30 swapper/0-1 [000] d..2. 56.286470: 0) 0.481 us | 0xffffffff925da5c4(); swapper/0-1 [000] d.... 56.286471: 0) 4.065 us | } swapper/0-1 [000] d.... 56.286471: 0) 4.920 us | } swapper/0-1 [000] d..1. 56.286472: 0) | 0xffffffff92536254() { swapper/0-1 [000] d..1. 56.286472: 0) + 28.974 us | 0xffffffff92534e30(); swapper/0-1 [000] d.... 56.286516: 0) + 43.881 us | } swapper/0-1 [000] d..1. 56.286517: 0) | 0xffffffff925136c4() { swapper/0-1 [000] d..1. 56.286518: 0) | 0xffffffff92514a14() { swapper/0-1 [000] d..1. 56.286518: 0) 6.003 us | 0xffffffff92514200(); swapper/0-1 [000] d.... 56.286529: 0) + 11.510 us | } swapper/0-1 [000] d.... 56.286529: 0) + 12.895 us | } swapper/0-1 [000] d.... 56.286530: 0) ! 382.884 us | } swapper/0-1 [000] d..1. 56.286530: 0) | 0xffffffff92536444() { swapper/0-1 [000] d..1. 56.286531: 0) | 0xffffffff92536254() { swapper/0-1 [000] d..1. 56.286531: 0) + 26.335 us | 0xffffffff92534e30(); swapper/0-1 [000] d.... 56.286560: 0) + 29.511 us | } swapper/0-1 [000] d.... 56.286561: 0) + 30.452 us | } swapper/0-1 [000] d..1. 56.286562: 0) | 0xffffffff9253c014() { swapper/0-1 [000] d..1. 56.286562: 0) | 0xffffffff9253bed4() { swapper/0-1 [000] d..1. 56.286563: 0) + 13.465 us | 0xffffffff92536684(); swapper/0-1 [000] d.... 56.286577: 0) + 14.651 us | } swapper/0-1 [000] d.... 56.286577: 0) + 15.821 us | } swapper/0-1 [000] d..1. 56.286578: 0) 0.667 us | 0xffffffff92547074(); swapper/0-1 [000] d..1. 56.286579: 0) 0.453 us | 0xffffffff924f35c4(); swapper/0-1 [000] d.... 56.286580: 0) # 3906.348 us | } swapper/0-1 [000] d..1. 56.286581: 0) | 0xffffffff92531a14() { swapper/0-1 [000] d..1. 56.286581: 0) 0.518 us | 0xffffffff92505cb4(); swapper/0-1 [000] d..1. 56.286595: 0) | 0xffffffff92db83c4() { swapper/0-1 [000] d..1. 56.286596: 0) | 0xffffffff92dec2e4() { swapper/0-1 [000] d..1. 56.286597: 0) | 0xffffffff92db5304() { It now shows: # trace-cmd show -B boot_mapped | tail -n30 swapper/0-1 [000] d..2. 363.079099: 0) 0.483 us | preempt_count_sub(); swapper/0-1 [000] d.... 363.079100: 0) 4.112 us | } swapper/0-1 [000] d.... 363.079101: 0) 4.979 us | } swapper/0-1 [000] d..1. 363.079101: 0) | disable_local_APIC() { swapper/0-1 [000] d..1. 363.079102: 0) + 29.153 us | clear_local_APIC.part.0(); swapper/0-1 [000] d.... 363.079148: 0) + 46.517 us | } swapper/0-1 [000] d..1. 363.079149: 0) | mcheck_cpu_clear() { swapper/0-1 [000] d..1. 363.079149: 0) | mce_intel_feature_clear() { swapper/0-1 [000] d..1. 363.079150: 0) 5.871 us | lmce_supported(); swapper/0-1 [000] d.... 363.079161: 0) + 11.340 us | } swapper/0-1 [000] d.... 363.079161: 0) + 12.638 us | } swapper/0-1 [000] d.... 363.079162: 0) ! 383.518 us | } swapper/0-1 [000] d..1. 363.079162: 0) | lapic_shutdown() { swapper/0-1 [000] d..1. 363.079163: 0) | disable_local_APIC() { swapper/0-1 [000] d..1. 363.079163: 0) + 26.144 us | clear_local_APIC.part.0(); swapper/0-1 [000] d.... 363.079192: 0) + 29.424 us | } swapper/0-1 [000] d.... 363.079192: 0) + 30.376 us | } swapper/0-1 [000] d..1. 363.079193: 0) | restore_boot_irq_mode() { swapper/0-1 [000] d..1. 363.079194: 0) | native_restore_boot_irq_mode() { swapper/0-1 [000] d..1. 363.079194: 0) + 13.863 us | disconnect_bsp_APIC(); swapper/0-1 [000] d.... 363.079209: 0) + 14.933 us | } swapper/0-1 [000] d.... 363.079209: 0) + 16.009 us | } swapper/0-1 [000] d..1. 363.079210: 0) 0.694 us | hpet_disable(); swapper/0-1 [000] d..1. 363.079211: 0) 0.511 us | iommu_shutdown_noop(); swapper/0-1 [000] d.... 363.079212: 0) # 3980.260 us | } swapper/0-1 [000] d..1. 363.079212: 0) | native_machine_emergency_restart() { swapper/0-1 [000] d..1. 363.079213: 0) 0.495 us | tboot_shutdown(); swapper/0-1 [000] d..1. 363.079230: 0) | acpi_reboot() { swapper/0-1 [000] d..1. 363.079231: 0) | acpi_reset() { swapper/0-1 [000] d..1. 363.079232: 0) | acpi_os_write_port() { Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Ross Zwisler <zwisler@google.com> Cc: Vincent Donnefort <vdonnefort@google.com> Link: https://lore.kernel.org/20240813171257.478901820@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-08-15tracing: Allow boot instances to use reserve_mem boot memorySteven Rostedt (Google)
Allow boot instances to use memory reserved by the reserve_mem boot option. reserve_mem=12M:4096:trace trace_instance=boot_mapped@trace The above will allocate 12 megs with 4096 alignment and label it "trace". The second parameter will create a "boot_mapped" instance and use the memory reserved and labeled as "trace" as the memory for the ring buffer. That will create an instance called "boot_mapped": /sys/kernel/tracing/instances/boot_mapped Note, because the ring buffer is using a defined memory ranged, it will act just like a memory mapped ring buffer. It will not have a snapshot buffer, as it can't swap out the buffer. The snapshot files as well as any tracers that uses a snapshot will not be present in the boot_mapped instance. Also note that reserve_mem is not reliable in acquiring the same physical memory at each soft reboot. It is possible that KALSR could map the kernel at the previous boot memory location forcing the reserve_mem to return a different memory location. In this case, the previous ring buffer will be lost. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Ross Zwisler <zwisler@google.com> Cc: Vincent Donnefort <vdonnefort@google.com> Link: https://lore.kernel.org/20240815082811.669f7d8c@gandalf.local.home Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-08-14tracing: Fix ifdef of snapshots to not prevent last_boot_info fileSteven Rostedt
The mapping of the ring buffer to memory allocated at boot up will also expose a "last_boot_info" to help tooling to read the raw data from the last boot. As instances that have their ring buffer mapped to fixed memory cannot perform snapshots, they can either have the "snapshot" file or the "last_boot_info" file, but not both. The code that added the "last_boot_info" file failed to notice that the "snapshot" creation was inside a "#ifdef CONFIG_TRACER_SNAPSHOT" and incorrectly placed the creation of the "last_boot_info" file within the ifdef block. Not only does it cause a warning when CONFIG_TRACER_SNAPSHOT is not enabled, it also incorrectly prevents the file from appearing. Link: https://lore.kernel.org/all/20240719102640.718554-1-arnd@kernel.org/ Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reported-by: Arnd Bergmann <arnd@kernel.org> Link: https://lore.kernel.org/20240719101312.3d4ac707@rorschach.local.home Fixes: 7a1d1e4b9639 ("tracing/ring-buffer: Add last_boot_info file to boot instance") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-08-14Merge tag 'v6.11-rc3' into trace/ring-buffer/coreSteven Rostedt
The "reserve_mem" kernel command line parameter has been pulled into v6.11. Merge the latest -rc3 to allow the persistent ring buffer memory to be able to be mapped at the address specified by the "reserve_mem" command line parameter. Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-08-15refscale: Constify struct ref_scale_opsChristophe JAILLET
'struct ref_scale_ops' are not modified in these drivers. Constifying this structure moves some data to a read-only section, so increase overall security. On a x86_64, with allmodconfig: Before: ====== text data bss dec hex filename 34231 4167 736 39134 98de kernel/rcu/refscale.o After: ===== text data bss dec hex filename 35175 3239 736 39150 98ee kernel/rcu/refscale.o Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Tested-by: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-15rcuscale: Count outstanding callbacks per-task rather than per-CPUPaul E. McKenney
The current rcu_scale_writer() asynchronous grace-period testing uses a per-CPU counter to track the number of outstanding callbacks. This is subject to CPU-imbalance errors when tasks migrate from one CPU to another between the time that the counter is incremented and the callback is queued, and additionally in kernels configured such that callbacks can be invoked on some CPU other than the one that queued it. This commit therefore arranges for per-task callback counts, thus avoiding any issues with migration of either tasks or callbacks. Reported-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-15rcuscale: NULL out top-level pointers to heap memoryPaul E. McKenney
Currently, if someone modprobes and rmmods rcuscale successfully, but the next run errors out during the modprobe, non-NULL pointers to freed memory will remain. If the run after that also errors out during the modprobe, there will be double-free bugs. This commit therefore NULLs out top-level pointers to memory that has just been freed. Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-15rcuscale: Use special allocator for rcu_scale_writer()Paul E. McKenney
The rcu_scale_writer() function needs only a fixed number of rcu_head structures per kthread, which means that a trivial allocator suffices. This commit therefore uses an llist-based allocator using a fixed array of structures per kthread. This allows aggressive testing of RCU performance without stressing the slab allocators. Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-15rcuscale: Make rcu_scale_writer() tolerate repeated GFP_KERNEL failurePaul E. McKenney
Under some conditions, kmalloc(GFP_KERNEL) allocations have been observed to repeatedly fail. This situation has been observed to cause one of the rcu_scale_writer() instances to loop indefinitely retrying memory allocation for an asynchronous grace-period primitive. The problem is that if memory is short, all the other instances will allocate all available memory before the looping task is awakened from its rcu_barrier*() call. This in turn results in hangs, so that rcuscale fails to complete. This commit therefore removes the tight retry loop, so that when this condition occurs, the affected task is still passing through the full loop with its full set of termination checks. This spreads the risk of indefinite memory-allocation retry failures across all instances of rcu_scale_writer() tasks, which in turn prevents the hangs. Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-15rcuscale: Make all writer tasks report upon hangPaul E. McKenney
This commit causes all writer tasks to provide a brief report after a hang has been reported, spaced at one-second intervals. Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-15rcuscale: Provide clear error when async specified without primitivesPaul E. McKenney
Currently, if the rcuscale module's async module parameter is specified for RCU implementations that do not have async primitives such as RCU Tasks Rude (which now lacks a call_rcu_tasks_rude() function), there will be a series of splats due to calls to a NULL pointer. This commit therefore warns of this situation, but switches to non-async testing. Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-15rcu: Let dump_cpu_task() be used without preemption disabledRyo Takakura
The commit 2d7f00b2f0130 ("rcu: Suppress smp_processor_id() complaint in synchronize_rcu_expedited_wait()") disabled preemption around dump_cpu_task() to suppress warning on its usage within preemtible context. Calling dump_cpu_task() doesn't required to be in non-preemptible context except for suppressing the smp_processor_id() warning. As the smp_processor_id() is evaluated along with in_hardirq() to check if it's in interrupt context, this patch removes the need for its preemtion disablement by reordering the condition so that smp_processor_id() only gets evaluated when it's in interrupt context. Signed-off-by: Ryo Takakura <takakura@valinux.co.jp> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-15rcu: Summarize expedited RCU CPU stall warnings during CSD-lock stallsPaul E. McKenney
During CSD-lock stalls, the additional information output by expedited RCU CPU stall warnings is usually redundant, flooding the console for not good reason. However, this has been the way things work for a few years. This commit therefore uses rcutree.csd_lock_suppress_rcu_stall kernel boot parameter that causes expedited RCU CPU stall warnings to be abbreviated to a single line when there is at least one CPU that has been stuck waiting for CSD lock for more than five seconds. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-15rcu: Extract synchronize_rcu_expedited_stall() from ↵Paul E. McKenney
synchronize_rcu_expedited_wait() This commit extracts the RCU CPU stall-warning report code from synchronize_rcu_expedited_wait() and places it in a new function named synchronize_rcu_expedited_stall(). This is strictly a code-movement commit. A later commit will use this reorganization to avoid printing expedited RCU CPU stall warnings while there are ongoing CSD-lock stall reports. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-08-15rcu: Summarize RCU CPU stall warnings during CSD-lock stallsPaul E. McKenney
During CSD-lock stalls, the additional information output by RCU CPU stall warnings is usually redundant, flooding the console for not good reason. However, this has been the way things work for a few years. This commit therefore adds an rcutree.csd_lock_suppress_rcu_stall kernel boot parameter that causes RCU CPU stall warnings to be abbreviated to a single line when there is at least one CPU that has been stuck waiting for CSD lock for more than five seconds. To make this abbreviated message happen with decent probability: tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 8 \ --configs "2*TREE01" --kconfig "CONFIG_CSD_LOCK_WAIT_DEBUG=y" \ --bootargs "csdlock_debug=1 rcutorture.stall_cpu=200 \ rcutorture.stall_cpu_holdoff=120 rcutorture.stall_cpu_irqsoff=1 \ rcutree.csd_lock_suppress_rcu_stall=1 \ rcupdate.rcu_exp_cpu_stall_timeout=5000" --trust-make [ paulmck: Apply kernel test robot feedback. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>