Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
"Two fixes for the timer/clocksource code:
- The recent fix to make the take over of the broadcast timer more
reliable retrieves a per CPU pointer in preemptible context.
This went unnoticed in testing as some compilers hoist the access
into the non-preemotible section where the pointer is actually
used, but obviously compilers can rightfully invoke it where the
code put it.
Move it into the non-preemptible section right to the actual usage
side to cure it.
- The clocksource watchdog is supposed to emit a warning when the
retry count is greater than one and the number of retries reaches
the limit.
The condition is backwards and warns always when the count is
greater than one. Fixup the condition to prevent spamming dmesg"
* tag 'timers-urgent-2024-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clocksource: Fix brown-bag boolean thinko in cs_watchdog_read()
tick/broadcast: Move per CPU pointer access into the atomic section
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Thomas Gleixner:
- When stime is larger than rtime due to accounting imprecision, then
utime = rtime - stime becomes negative. As this is unsigned math, the
result becomes a huge positive number.
Cure it by resetting stime to rtime in that case, so utime becomes 0.
- Restore consistent state when sched_cpu_deactivate() fails.
When offlining a CPU fails in sched_cpu_deactivate() after the SMT
present counter has been decremented, then the function aborts but
fails to increment the SMT present counter and leaves it imbalanced.
Consecutive operations cause it to underflow. Add the missing fixup
for the error path.
For SMT accounting the runqueue needs to marked online again in the
error exit path to restore consistent state.
* tag 'sched-urgent-2024-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/core: Fix unbalance set_rq_online/offline() in sched_cpu_deactivate()
sched/core: Introduce sched_set_rq_on/offline() helper
sched/smt: Fix unbalance sched_smt_present dec/inc
sched/smt: Introduce sched_smt_present_inc/dec() helper
sched/cputime: Fix mul_u64_u64_div_u64() precision for cputime
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Thomas Gleixner:
"Two fixes for locking and jump labels:
- Ensure that the atomic_cmpxchg() conditions are correct and
evaluating to true on any non-zero value except 1. The missing
check of the return value leads to inconsisted state of the jump
label counter.
- Add a missing type conversion in the paravirt spinlock code which
makes loongson build again"
* tag 'locking-urgent-2024-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
jump_label: Fix the fix, brown paper bags galore
locking/pvqspinlock: Correct the type of "old" variable in pv_kick_node()
|
|
Right now, cs_watchdog_read() does clocksource sanity checks based
on WATCHDOG_MAX_SKEW, which sets a floor on any clocksource's
.uncertainty_margin. These sanity checks can therefore act
inappropriately for clocksources with large uncertainty margins.
One reason for a clocksource to have a large .uncertainty_margin is when
that clocksource has long read-out latency, given that it does not make
sense for the .uncertainty_margin to be smaller than the read-out latency.
With the current checks, cs_watchdog_read() could reject all normal
reads from a clocksource with long read-out latencies, such as those
from legacy clocksources that are no longer implemented in hardware.
Therefore, recast the cs_watchdog_read() checks in terms of the
.uncertainty_margin values of the clocksources involved in the timespan in
question. The first covers two watchdog reads and one cs read, so use
twice the watchdog .uncertainty_margin plus that of the cs. The second
covers only a pair of watchdog reads, so use twice the watchdog
.uncertainty_margin.
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240802154618.4149953-4-paulmck@kernel.org
|
|
The WATCHDOG_THRESHOLD macro is no longer used to supply a default value
for ->uncertainty_margin, but WATCHDOG_MAX_SKEW now is.
Therefore, update the comments to reflect this change.
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/all/20240802154618.4149953-3-paulmck@kernel.org
|
|
Add more detail on the rationale for bounding the clocksource
->uncertainty_margin below at about 500ppm.
Signed-off-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240802154618.4149953-1-paulmck@kernel.org
|
|
The current "nretries > 1 || nretries >= max_retries" check in
cs_watchdog_read() will always evaluate to true, and thus pr_warn(), if
nretries is greater than 1. The intent is instead to never warn on the
first try, but otherwise warn if the successful retry was the last retry.
Therefore, change that "||" to "&&".
Fixes: db3a34e17433 ("clocksource: Retry clock read if long delays detected")
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20240802154618.4149953-2-paulmck@kernel.org
|
|
CONFIG_HOTPLUG_PARALLEL expects the architecture to implement
arch_cpuhp_init_parallel_bringup() to decide whether paralllel hotplug is
possible and to do the necessary architecture specific initialization.
There are architectures which can enable it unconditionally and do not
require architecture specific initialization.
Provide a weak fallback for arch_cpuhp_init_parallel_bringup() so that
such architectures are not forced to implement empty stub functions.
Signed-off-by: Jiaxun Yang <jiaxun.yang@flygoat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240716-loongarch-hotplug-v3-2-af59b3bb35c8@flygoat.com
|
|
Provide stub functions for SMT related parallel bring up functions so that
HOTPLUG_PARALLEL can work without HOTPLUG_SMT.
Signed-off-by: Jiaxun Yang <jiaxun.yang@flygoat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240716-loongarch-hotplug-v3-1-af59b3bb35c8@flygoat.com
|
|
As Documentation/filesystems/sysfs.rst suggested, show() should only use
sysfs_emit() or sysfs_emit_at() when formatting the value to be returned
to user space.
No functional change intended.
Signed-off-by: Xueqin Luo <luoxueqin@kylinos.cn>
Link: https://patch.msgid.link/20240801083156.2513508-3-luoxueqin@kylinos.cn
[ rjw: Subject edit ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
As Documentation/filesystems/sysfs.rst suggested, show() should only
use sysfs_emit() or sysfs_emit_at() when formatting the value to be
returned to user space.
No functional change intended.
Signed-off-by: Xueqin Luo <luoxueqin@kylinos.cn>
Link: https://patch.msgid.link/20240801083156.2513508-2-luoxueqin@kylinos.cn
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Cgroup subsystem state (CSS) is an abstraction in the cgroup layer to
help manage different structures in various cgroup subsystems by being
an embedded element inside a larger structure like cpuset or mem_cgroup.
The /proc/cgroups file shows the number of cgroups for each of the
subsystems. With cgroup v1, the number of CSSes is the same as the
number of cgroups. That is not the case anymore with cgroup v2. The
/proc/cgroups file cannot show the actual number of CSSes for the
subsystems that are bound to cgroup v2.
So if a v2 cgroup subsystem is leaking cgroups (usually memory cgroup),
we can't tell by looking at /proc/cgroups which cgroup subsystems may
be responsible.
As cgroup v2 had deprecated the use of /proc/cgroups, the hierarchical
cgroup.stat file is now being extended to show the number of live and
dying CSSes associated with all the non-inhibited cgroup subsystems that
have been bound to cgroup v2. The number includes CSSes in the current
cgroup as well as in all the descendants underneath it. This will help
us pinpoint which subsystems are responsible for the increasing number
of dying (nr_dying_descendants) cgroups.
The CSSes dying counts are stored in the cgroup structure itself
instead of inside the CSS as suggested by Johannes. This will allow
us to accurately track dying counts of cgroup subsystems that have
recently been disabled in a cgroup. It is now possible that a zero
subsystem number is coupled with a non-zero dying subsystem number.
The cgroup-v2.rst file is updated to discuss this new behavior.
With this patch applied, a sample output from root cgroup.stat file
was shown below.
nr_descendants 56
nr_subsys_cpuset 1
nr_subsys_cpu 43
nr_subsys_io 43
nr_subsys_memory 56
nr_subsys_perf_event 57
nr_subsys_hugetlb 1
nr_subsys_pids 56
nr_subsys_rdma 1
nr_subsys_misc 1
nr_dying_descendants 30
nr_dying_subsys_cpuset 0
nr_dying_subsys_cpu 0
nr_dying_subsys_io 0
nr_dying_subsys_memory 30
nr_dying_subsys_perf_event 0
nr_dying_subsys_hugetlb 0
nr_dying_subsys_pids 0
nr_dying_subsys_rdma 0
nr_dying_subsys_misc 0
Another sample output from system.slice/cgroup.stat was:
nr_descendants 34
nr_subsys_cpuset 0
nr_subsys_cpu 32
nr_subsys_io 32
nr_subsys_memory 34
nr_subsys_perf_event 35
nr_subsys_hugetlb 0
nr_subsys_pids 34
nr_subsys_rdma 0
nr_subsys_misc 0
nr_dying_descendants 30
nr_dying_subsys_cpuset 0
nr_dying_subsys_cpu 0
nr_dying_subsys_io 0
nr_dying_subsys_memory 30
nr_dying_subsys_perf_event 0
nr_dying_subsys_hugetlb 0
nr_dying_subsys_pids 0
nr_dying_subsys_rdma 0
nr_dying_subsys_misc 0
Note that 'debug' controller wasn't used to provide this information because
the controller is not recommended in productions kernels, also many of them
won't enable CONFIG_CGROUP_DEBUG by default.
Similar information could be retrieved with debuggers like drgn but that's
also not always available (e.g. lockdown) and the additional cost of runtime
tracking here is deemed marginal.
tj: Added Michal's paragraphs on why this is not added the debug controller
to the commit message.
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Kamalesh Babulal <kamalesh.babulal@oracle.com>
Cc: Michal Koutný <mkoutny@suse.com>
Link: http://lkml.kernel.org/r/20240715150034.2583772-1-longman@redhat.com
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Per the example of:
!atomic_cmpxchg(&key->enabled, 0, 1)
the inverse was written as:
atomic_cmpxchg(&key->enabled, 1, 0)
except of course, that while !old is only true for old == 0, old is
true for everything except old == 0.
Fix it to read:
atomic_cmpxchg(&key->enabled, 1, 0) == 1
such that only the 1->0 transition returns true and goes on to disable
the keys.
Fixes: 83ab38ef0a0b ("jump_label: Fix concurrency issues in static_key_slow_dec()")
Reported-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lkml.kernel.org/r/20240731105557.GY33588@noisy.programming.kicks-ass.net
|
|
The recent fix for making the take over of the broadcast timer more
reliable retrieves a per CPU pointer in preemptible context.
This went unnoticed as compilers hoist the access into the non-preemptible
region where the pointer is actually used. But of course it's valid that
the compiler keeps it at the place where the code puts it which rightfully
triggers:
BUG: using smp_processor_id() in preemptible [00000000] code:
caller is hotplug_cpu__broadcast_tick_pull+0x1c/0xc0
Move it to the actual usage site which is in a non-preemptible region.
Fixes: f7d43dd206e7 ("tick/broadcast: Make takeover of broadcast hrtimer reliable")
Reported-by: David Wang <00107082@163.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Yu Liao <liaoyu15@huawei.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/87ttg56ers.ffs@tglx
|
|
The process of constructing scheduling domains
involves multiple loops and repeated evaluations, leading to numerous
redundant and ineffective assessments that impact code efficiency.
Here, we use union-find to optimize the merging of cpumasks. By employing
path compression and union by rank, we effectively reduce the number of
lookups and merge comparisons.
Signed-off-by: Xavier <xavier_qy@163.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
There are several functions to decrease attach_in_progress, and they
will wake up cpuset_attach_wq when attach_in_progress is zero. So,
add a helper to make it concise.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Kamalesh Babulal <kamalesh.babulal@oracle.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Since the SLAB implementation was removed in v6.8, so the
cpuset_slab_spread_rotor is no longer used and can be removed.
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Currently when a process in a group forks and fails due to it's
parent's max restriction, all the cgroups from 'pids_forking' to root
will generate event notifications but only the cgroups from
'pids_over_limit' to root will increase the counter of PIDCG_MAX.
Consider this scenario: there are 4 groups A, B, C,and D, the
relationships are as follows, and user is watching on C.pids.events.
root->A->B->C->D
When a process in D forks and fails due to B.max restriction, the
user will get a spurious event notification because when he wakes up
and reads C.pids.events, he will find that the content has not changed.
To address this issue, only the cgroups from 'pids_over_limit' to root
will have their PIDCG_MAX counters increased and event notifications
generated.
Fixes: 385a635cacfe ("cgroup/pids: Make event counters hierarchical")
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
The child_ecpus_count variable was previously used to update
sibling cpumask when parent's effective_cpus is updated. However, it became
obsolete after commit e2ffe502ba45 ("cgroup/cpuset: Add
cpuset.cpus.exclusive for v2"). It should be removed.
tj: Restored {} for style consistency.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
For some reason I didn't see this issue on my arm64 or x86-64 builds,
but Stephen Rothwell reports that commit 2accfdb7eff6 ("profiling:
attempt to remove per-cpu profile flip buffer") left these static
variables around, and the powerpc build is unhappy about them:
kernel/profile.c:52:28: warning: 'cpu_profile_flip' defined but not used [-Wunused-variable]
52 | static DEFINE_PER_CPU(int, cpu_profile_flip);
| ^~~~~~~~~~~~~~~~
..
So remove these stale left-over remnants too.
Fixes: 2accfdb7eff6 ("profiling: attempt to remove per-cpu profile flip buffer")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
These really can be handled gracefully without killing the machine.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
The task pointer which is handed to dequeue_signal() is always current. The
argument along with the first comment about signalfd in that function is
confusing at best. Remove it and use current internally.
Update the stale comment for dequeue_signal() while at it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
Rename posix_timer_event() to posix_timer_queue_signal() as this is what
the function is about.
Consolidate the requeue pending and deactivation updates into that function
as there is no point in doing this in all incarnations of posix timers.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
Posix CPU timers are not updating k_itimer::it_active which makes it
impossible to base decisions in the common posix timer code on it.
Update it when queueing or dequeueing posix CPU timers.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
hrtimer based and CPU timers have their own way to install the new interval
and to reset overrun and signal handling related data.
Create a helper function and do the same operation for all variants.
This also makes the handling of the interval consistent. It's only stored
when the timer is actually armed, i.e. timer->it_value != 0. Before that it
was stored unconditionally for posix CPU timers and conditionally for the
other posix timers.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
No requirement for a real list. Spare a few bytes.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
Keeping the overrun count of the previous setup around is just wrong. The
new setting has nothing to do with the previous one and has to start from a
clean slate.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
No point in doing this all over the place.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
Avoid the late sighand lock/unlock dance when a timer is not armed to
enforce reevaluation of the timer base so that the process wide CPU timer
sampling can be disabled.
Do it right at the point where the arming decision is made which already
has sighand locked.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
A leftover from historical code which describes fiction.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
posix_cpu_timer_set() uses @val as variable for the current time. That's
confusing at best.
Use @now as anywhere else and rewrite the confusing comment about clock
sampling.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
There is no point in arming SIGEV_NONE timers as they never deliver a
signal. timer_gettime() is handling the expiry time correctly and that's
all SIGEV_NONE timers care about.
Prevent arming them and remove the expiry handler code which just disarms
them.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
Reuse the split out __posix_cpu_timer_get() function which does already the
right thing.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
Expired SIGEV_NONE oneshot timers must return 0 nsec for the expiry time in
timer_get(), but the posix CPU timer implementation returns 1 nsec.
Add the missing conditional.
This will be cleaned up in a follow up patch.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
Expired SIGEV_NONE oneshot timers must return 0 nsec for the expiry time in
timer_get(), but the posix CPU timer implementation returns 1 nsec.
Add the missing conditional.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
timer_gettime() must return the remaining time to the next expiry of a
timer or 0 if the timer is not armed and no signal pending, but posix CPU
timers fail to forward a timer which is already expired.
Add the required logic to address that.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
There is no point to return the interval for timers which have been
disarmed.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
In preparation for addressing issues in the timer_get() and timer_set()
functions of posix CPU timers.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
The TWA_NMI_CURRENT handling very much depends on IRQ_WORK, but that
isn't universally enabled everywhere.
Maybe the IRQ_WORK infrastructure should just be unconditional - x86
ends up indirectly enabling it through unconditionally enabling
PERF_EVENTS, for example. But it also gets enabled by having SMP
support, or even if you just have PRINTK enabled.
But in the meantime TWA_NMI_CURRENT causes tons of build failures on
various odd minimal configs. Which did show up in linux-next, but
despite that nobody bothered to fix it or even inform me until -rc1 was
out.
Fixes: 466e4d801cd4 ("task_work: Add TWA_NMI_CURRENT as an additional notify mode")
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Reported-by: kernelci.org bot <bot@kernelci.org>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This is the really old legacy kernel profiling code, which has long
since been obviated by "real profiling" (ie 'prof' and company), and
mainly remains as a source of syzbot reports.
There are anecdotal reports that people still use it for boot-time
profiling, but it's unlikely that such use would care about the old NUMA
optimizations in this code from 2004 (commit ad02973d42: "profile: 512x
Altix timer interrupt livelock fix" in the BK import archive at [1])
So in order to head off future syzbot reports, let's try to simplify
this code and get rid of the per-cpu profile buffers that are quite a
large portion of the complexity footprint of this thing (including CPU
hotplug callbacks etc).
It's unlikely anybody will actually notice, or possibly, as Thomas put
it: "Only people who indulge in nostalgia will notice :)".
That said, if it turns out that this code is actually actively used by
somebody, we can always revert this removal. Thus the "attempt" in the
summary line.
[ Note: in a small nod to "the profiling code can cause NUMA problems",
this also removes the "increment the last entry in the profiling array
on any unknown hits" logic. That would account any program counter in
a module to that single counter location, and might exacerbate any
NUMA cacheline bouncing issues ]
Link: https://lore.kernel.org/all/CAHk-=wgs52BxT4Zjmjz8aNvHWKxf5_ThBY4bYL1Y6CTaNL2dTw@mail.gmail.com/
Link: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git [1]
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
syzbot is reporting uninit-value at profile_hits(), for there is a race
window between
if (!alloc_cpumask_var(&prof_cpu_mask, GFP_KERNEL))
return -ENOMEM;
cpumask_copy(prof_cpu_mask, cpu_possible_mask);
in profile_init() and
cpumask_available(prof_cpu_mask) &&
cpumask_test_cpu(smp_processor_id(), prof_cpu_mask))
in profile_tick(); prof_cpu_mask remains uninitialzed until cpumask_copy()
completes while cpumask_available(prof_cpu_mask) returns true as soon as
alloc_cpumask_var(&prof_cpu_mask) completes.
We could replace alloc_cpumask_var() with zalloc_cpumask_var() and
call cpumask_copy() from create_proc_profile() on only UP kernels, for
profile_online_cpu() calls cpumask_set_cpu() as needed via
cpuhp_setup_state(CPUHP_AP_ONLINE_DYN) on SMP kernels. But this patch
removes prof_cpu_mask because it seems unnecessary.
The cpumask_test_cpu(smp_processor_id(), prof_cpu_mask) test
in profile_tick() is likely always true due to
a CPU cannot call profile_tick() if that CPU is offline
and
cpumask_set_cpu(cpu, prof_cpu_mask) is called when that CPU becomes
online and cpumask_clear_cpu(cpu, prof_cpu_mask) is called when that
CPU becomes offline
. This test could be false during transition between online and offline.
But according to include/linux/cpuhotplug.h , CPUHP_PROFILE_PREPARE
belongs to PREPARE section, which means that the CPU subjected to
profile_dead_cpu() cannot be inside profile_tick() (i.e. no risk of
use-after-free bug) because interrupt for that CPU is disabled during
PREPARE section. Therefore, this test is guaranteed to be true, and
can be removed. (Since profile_hits() checks prof_buffer != NULL, we
don't need to check prof_buffer != NULL here unless get_irq_regs() or
user_mode() is such slow that we want to avoid when prof_buffer == NULL).
do_profile_hits() is called from profile_tick() from timer interrupt
only if cpumask_test_cpu(smp_processor_id(), prof_cpu_mask) is true and
prof_buffer is not NULL. But syzbot is also reporting that sometimes
do_profile_hits() is called while current thread is still doing vzalloc(),
where prof_buffer must be NULL at this moment. This indicates that multiple
threads concurrently tried to write to /sys/kernel/profiling interface,
which caused that somebody else try to re-allocate prof_buffer despite
somebody has already allocated prof_buffer. Fix this by using
serialization.
Reported-by: syzbot <syzbot+b1a83ab2a9eb9321fbdd@syzkaller.appspotmail.com>
Closes: https://syzkaller.appspot.com/bug?extid=b1a83ab2a9eb9321fbdd
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Tested-by: syzbot <syzbot+b1a83ab2a9eb9321fbdd@syzkaller.appspotmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
If cpuset_cpu_inactive() fails, set_rq_online() need be called to rollback.
Fixes: 120455c514f7 ("sched: Fix hotplug vs CPU bandwidth control")
Cc: stable@kernel.org
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240703031610.587047-5-yangyingliang@huaweicloud.com
|
|
Introduce sched_set_rq_on/offline() helper, so it can be called
in normal or error path simply. No functional changed.
Cc: stable@kernel.org
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240703031610.587047-4-yangyingliang@huaweicloud.com
|
|
I got the following warn report while doing stress test:
jump label: negative count!
WARNING: CPU: 3 PID: 38 at kernel/jump_label.c:263 static_key_slow_try_dec+0x9d/0xb0
Call Trace:
<TASK>
__static_key_slow_dec_cpuslocked+0x16/0x70
sched_cpu_deactivate+0x26e/0x2a0
cpuhp_invoke_callback+0x3ad/0x10d0
cpuhp_thread_fun+0x3f5/0x680
smpboot_thread_fn+0x56d/0x8d0
kthread+0x309/0x400
ret_from_fork+0x41/0x70
ret_from_fork_asm+0x1b/0x30
</TASK>
Because when cpuset_cpu_inactive() fails in sched_cpu_deactivate(),
the cpu offline failed, but sched_smt_present is decremented before
calling sched_cpu_deactivate(), it leads to unbalanced dec/inc, so
fix it by incrementing sched_smt_present in the error path.
Fixes: c5511d03ec09 ("sched/smt: Make sched_smt_present track topology")
Cc: stable@kernel.org
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Link: https://lore.kernel.org/r/20240703031610.587047-3-yangyingliang@huaweicloud.com
|
|
Introduce sched_smt_present_inc/dec() helper, so it can be called
in normal or error path simply. No functional changed.
Cc: stable@kernel.org
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240703031610.587047-2-yangyingliang@huaweicloud.com
|
|
In extreme test scenarios:
the 14th field utime in /proc/xx/stat is greater than sum_exec_runtime,
utime = 18446744073709518790 ns, rtime = 135989749728000 ns
In cputime_adjust() process, stime is greater than rtime due to
mul_u64_u64_div_u64() precision problem.
before call mul_u64_u64_div_u64(),
stime = 175136586720000, rtime = 135989749728000, utime = 1416780000.
after call mul_u64_u64_div_u64(),
stime = 135989949653530
unsigned reversion occurs because rtime is less than stime.
utime = rtime - stime = 135989749728000 - 135989949653530
= -199925530
= (u64)18446744073709518790
Trigger condition:
1). User task run in kernel mode most of time
2). ARM64 architecture
3). TICK_CPU_ACCOUNTING=y
CONFIG_VIRT_CPU_ACCOUNTING_NATIVE is not set
Fix mul_u64_u64_div_u64() conversion precision by reset stime to rtime
Fixes: 3dc167ba5729 ("sched/cputime: Improve cputime_adjust()")
Signed-off-by: Zheng Zucheng <zhengzucheng@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20240726023235.217771-1-zhengzucheng@huawei.com
|
|
"enum vcpu_state" is not compatible with "u8" type for all targets,
resulting in:
error: initialization of 'u8 *' {aka 'unsigned char *'} from incompatible pointer type 'enum vcpu_state *'
for LoongArch. Correct the type of "old" variable to "u8".
Fixes: fea0e1820b51 ("locking/pvqspinlock: Use try_cmpxchg() in qspinlock_paravirt.h")
Closes: https://lore.kernel.org/lkml/20240719024010.3296488-1-maobibo@loongson.cn/
Reported-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/r/20240721164552.50175-1-ubizjak@gmail.com
|
|
This just standardizes the use of MIN() and MAX() macros, with the very
traditional semantics. The goal is to use these for C constant
expressions and for top-level / static initializers, and so be able to
simplify the min()/max() macros.
These macro names were used by various kernel code - they are very
traditional, after all - and all such users have been fixed up, with a
few different approaches:
- trivial duplicated macro definitions have been removed
Note that 'trivial' here means that it's obviously kernel code that
already included all the major kernel headers, and thus gets the new
generic MIN/MAX macros automatically.
- non-trivial duplicated macro definitions are guarded with #ifndef
This is the "yes, they define their own versions, but no, the include
situation is not entirely obvious, and maybe they don't get the
generic version automatically" case.
- strange use case #1
A couple of drivers decided that the way they want to describe their
versioning is with
#define MAJ 1
#define MIN 2
#define DRV_VERSION __stringify(MAJ) "." __stringify(MIN)
which adds zero value and I just did my Alexander the Great
impersonation, and rewrote that pointless Gordian knot as
#define DRV_VERSION "1.2"
instead.
- strange use case #2
A couple of drivers thought that it's a good idea to have a random
'MIN' or 'MAX' define for a value or index into a table, rather than
the traditional macro that takes arguments.
These values were re-written as C enum's instead. The new
function-line macros only expand when followed by an open
parenthesis, and thus don't clash with enum use.
Happily, there weren't really all that many of these cases, and a lot of
users already had the pattern of using '#ifndef' guarding (or in one
case just using '#undef MIN') before defining their own private version
that does the same thing. I left such cases alone.
Cc: David Laight <David.Laight@aculab.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer migration updates from Thomas Gleixner:
"Fixes and minor updates for the timer migration code:
- Stop testing the group->parent pointer as it is not guaranteed to
be stable over a chain of operations by design.
This includes a warning which would be nice to have but it produces
false positives due to the racy nature of the check.
- Plug a race between CPUs going in and out of idle and a CPU hotplug
operation. The latter can create and connect a new hierarchy level
which is missed in the concurrent updates of CPUs which go into
idle. As a result the events of such a CPU might not be processed
and timers go stale.
Cure it by splitting the hotplug operation into a prepare and
online callback. The prepare callback is guaranteed to run on an
online and therefore active CPU. This CPU updates the hierarchy and
being online ensures that there is always at least one migrator
active which handles the modified hierarchy correctly when going
idle. The online callback which runs on the incoming CPU then just
marks the CPU active and brings it into operation.
- Improve tracing and polish the code further so it is more obvious
what's going on"
* tag 'timers-urgent-2024-07-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timers/migration: Fix grammar in comment
timers/migration: Spare write when nothing changed
timers/migration: Rename childmask by groupmask to make naming more obvious
timers/migration: Read childmask and parent pointer in a single place
timers/migration: Use a single struct for hierarchy walk data
timers/migration: Improve tracing
timers/migration: Move hierarchy setup into cpuhotplug prepare callback
timers/migration: Do not rely always on group->parent
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bpf and netfilter.
A lot of networking people were at a conference last week, busy
catching COVID, so relatively short PR.
Current release - regressions:
- tcp: process the 3rd ACK with sk_socket for TFO and MPTCP
Current release - new code bugs:
- l2tp: protect session IDR and tunnel session list with one lock,
make sure the state is coherent to avoid a warning
- eth: bnxt_en: update xdp_rxq_info in queue restart logic
- eth: airoha: fix location of the MBI_RX_AGE_SEL_MASK field
Previous releases - regressions:
- xsk: require XDP_UMEM_TX_METADATA_LEN to actuate tx_metadata_len,
the field reuses previously un-validated pad
Previous releases - always broken:
- tap/tun: drop short frames to prevent crashes later in the stack
- eth: ice: add a per-VF limit on number of FDIR filters
- af_unix: disable MSG_OOB handling for sockets in sockmap/sockhash"
* tag 'net-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (34 commits)
tun: add missing verification for short frame
tap: add missing verification for short frame
mISDN: Fix a use after free in hfcmulti_tx()
gve: Fix an edge case for TSO skb validity check
bnxt_en: update xdp_rxq_info in queue restart logic
tcp: process the 3rd ACK with sk_socket for TFO/MPTCP
selftests/bpf: Add XDP_UMEM_TX_METADATA_LEN to XSK TX metadata test
xsk: Require XDP_UMEM_TX_METADATA_LEN to actuate tx_metadata_len
bpf: Fix a segment issue when downgrading gso_size
net: mediatek: Fix potential NULL pointer dereference in dummy net_device handling
MAINTAINERS: make Breno the netconsole maintainer
MAINTAINERS: Update bonding entry
net: nexthop: Initialize all fields in dumped nexthops
net: stmmac: Correct byte order of perfect_match
selftests: forwarding: skip if kernel not support setting bridge fdb learning limit
tipc: Return non-zero value from tipc_udp_addr2str() on error
netfilter: nft_set_pipapo_avx2: disable softinterrupts
ice: Fix recipe read procedure
ice: Add a per-VF limit on number of FDIR filters
net: bonding: correctly annotate RCU in bond_should_notify_peers()
...
|