Age | Commit message (Collapse) | Author |
|
Move cad_pid as well as supporting function proc_do_cad_pid into
kernel/pic.c. Replaced call to __do_proc_dointvec with proc_dointvec
inside proc_do_cad_pid which requires the copy of the ctl_table to
handle the temp value.
This is part of a greater effort to move ctl tables into their
respective subsystems which will reduce the merge conflicts in
kernel/sysctl.c.
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Kees Cook <kees@kernel.org>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
|
|
Move the ctl_table with the "tainted" proc_name into kernel/panic.c.
With it moves the proc_tainted helper function.
This is part of a greater effort to move ctl tables into their
respective subsystems which will reduce the merge conflicts in
kernel/sysctl.c.
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Kees Cook <kees@kernel.org>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
|
|
Move both sysrq ctl_table and supported sysrq_sysctl_handler helper
function into drivers/tty/sysrq.c. Replaced the __do_proc_dointvec in
helper function with do_proc_dointvec_minmax as the former is local to
kernel/sysctl.c. Here we use the minmax version of do_proc_dointvec
because do_proc_dointvec is static and calling do_proc_dointvec_minmax
with a NULL min and max is the same as calling do_proc_dointvec.
This is part of a greater effort to move ctl tables into their
respective subsystems which will reduce the merge conflicts in
kernel/sysctl.c.
Reviewed-by: Kees Cook <kees@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
|
|
make sysctl_max_threads static as it no longer needs to be exported into
sysctl.c.
This is part of a greater effort to move ctl tables into their
respective subsystems which will reduce the merge conflicts in
kernel/sysctl.c.
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Kees Cook <kees@kernel.org>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
|
|
Move the soft-power ctl table into parisc/power.c. As a consequence the
pwrsw_enabled var is made static.
This is part of a greater effort to move ctl tables into their
respective subsystems which will reduce the merge conflicts in
kernel/sysctl.c.
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Kees Cook <kees@kernel.org>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
|
|
Move the randomize_va_space variable together with all its sysctl table
elements into memory.c. Register it to the "kernel" directory by
adding it to the subsys initialization calls
This is part of a greater effort to move ctl tables into their
respective subsystems which will reduce the merge conflicts in
kernel/sysctl.c.
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Kees Cook <kees@kernel.org>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
|
|
Move sysctl_panic_on_rcu_stall and sysctl_max_rcu_stall_to_panic into
the kernel/rcu subdirectory. Make these static in tree_stall.h and
removed them as extern from panic.h as their scope is now confined into
one file.
This is part of a greater effort to move ctl tables into their
respective subsystems which will reduce the merge conflicts in
kernel/sysctl.c.
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
|
|
Move the max_lock_depth sysctl table element into rtmutex_api.c. Removed
the rtmutex.h include from sysctl.c. Chose to move into rtmutex_api.c
to avoid multiple registrations every time rtmutex.c is included in other
files.
This is part of a greater effort to move ctl tables into their
respective subsystems which will reduce the merge conflicts in
kernel/sysctl.c.
Signed-off-by: Joel Granados <joel.granados@kernel.org>
|
|
subsys
Move module sysctl (modprobe_path and modules_disabled) out of sysctl.c
and into the modules subsystem. Make modules_disabled static as it no
longer needs to be exported. Remove module.h from the includes in sysctl
as it no longer uses any module exported variables.
This is part of a greater effort to move ctl tables into their
respective subsystems which will reduce the merge conflicts in
kernel/sysctl.c.
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
|
|
Newer compiler versions rightfully point out:
kernel/kcsan/kcsan_test.c:591:41: error: variable 'dummy' is
uninitialized when passed as a const pointer argument here
[-Werror,-Wuninitialized-const-pointer]
591 | KCSAN_EXPECT_READ_BARRIER(atomic_read(&dummy), false);
| ^~~~~
1 error generated.
Although this particular test does not care about the value stored in
the dummy atomic variable, let's silence the warning.
Link: https://lkml.kernel.org/r/CA+G9fYu8JY=k-r0hnBRSkQQrFJ1Bz+ShdXNwC1TNeMt0eXaxeA@mail.gmail.com
Fixes: 8bc32b348178 ("kcsan: test: Add test cases for memory barrier instrumentation")
Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
Reviewed-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Marco Elver <elver@google.com>
|
|
Fix typo "allocade" -> "allocated".
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250710095628.42ed6b06@batman.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Ftrace is tightly coupled with architecture specific code because it
requires the use of trampolines written in assembly. This means that when
a new feature or optimization is made, it must be done for all
architectures. To simplify the approach, CONFIG_HAVE_FTRACE_* configs are
added to denote which architecture has the new enhancement so that other
architectures can still function until they too have been updated.
The CONFIG_HAVE_FTRACE_MCOUNT was added to help simplify the
DYNAMIC_FTRACE work, but now every architecture that implements
DYNAMIC_FTRACE also has HAVE_FTRACE_MCOUNT set too, making it redundant
with the HAVE_DYNAMIC_FTRACE.
Remove the HAVE_FTRACE_MCOUNT config and use DYNAMIC_FTRACE directly where
applicable.
Link: https://lore.kernel.org/all/20250703154916.48e3ada7@gandalf.local.home/
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/20250704104838.27a18690@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
When soft disabling of trace events was first created, it needed to have a
way to know if a file had a user that was using it with soft disabled (for
triggers that need to enable or disable events from a context that can not
really enable or disable the event, it would set SOFT_DISABLED to state it
is disabled). The flag SOFT_MODE was used to denote that an event had a
user that would enable or disable it via the SOFT_DISABLED flag.
Commit 1cf4c0732db3c ("tracing: Modify soft-mode only if there's no other
referrer") fixed a bug where if two users were using the SOFT_DISABLED
flag the accounting would get messed up as the SOFT_MODE flag could only
handle one user. That commit added the sm_ref counter which kept track of
how many users were using the event in "soft mode". This made the
SOFT_MODE flag redundant as it should only be set if the sm_ref counter is
non zero.
Remove the SOFT_MODE flag and just use the sm_ref counter to know the
event is in soft mode or not. This makes the code a bit simpler.
Link: https://lore.kernel.org/all/20250702111908.03759998@batman.local.home/
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Gabriele Paoloni <gpaoloni@redhat.com>
Link: https://lore.kernel.org/20250702143657.18dd1882@batman.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Memory barriers are useful to ensure memory accesses from one CPU appear in
the original order as seen by other CPUs.
Some smp_rmb() and smp_wmb() are used, but they are not ordering multiple
memory accesses.
Remove them.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Gabriele Monaco <gmonaco@redhat.com>
Link: https://lore.kernel.org/20250626151940.1756398-1-namcao@linutronix.de
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
ftrace has two flavors:
1) static: Where every function always calls the ftrace trampoline
2) dynamic: Where each function has nops that can be changed on demand to
jump to the ftrace trampoline when needed.
The static flavor has very high performance overhead and was only created
to make it easier for architectures to implement the dynamic flavor. An
architecture developer can first implement the static ftrace to make sure
the trampolines work before working on the more complicated dynamic aspect
of ftrace. Once the architecture can support dynamic ftrace, there's no
reason to continue to support the static flavor. In fact, the static
flavor tends to bitrot and bugs start to appear in them.
Remove the prompt to pick DYNAMIC_FTRACE and simply enable it if the
architecture supports it.
Link: https://lore.kernel.org/all/f7e12c6d-892e-4ca3-9ef0-fbb524d04a48@ghiti.fr/
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: ChenMiao <chenmiao.ku@gmail.com>
Link: https://lore.kernel.org/20250703115222.2d7c8cd5@batman.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Add a warning if unregister_ftrace_graph() is called without ever
registering it, or if register_ftrace_graph() is called twice. This can
detect errors when they happen and not later when there's a side effect:
Link: https://lore.kernel.org/all/20250617120830.24fbdd62@gandalf.local.home/
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/20250701194451.22e34724@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
When the ring buffer was first introduced, reading the non-consuming
"trace" file required disabling the writing of the ring buffer. To make
sure the writing was fully disabled before iterating the buffer with a
non-consuming read, it would set the disable flag of the buffer and then
call an RCU synchronization to make sure all the buffers were
synchronized.
The function ring_buffer_read_start() originally would initialize the
iterator and call an RCU synchronization, but this was for each individual
per CPU buffer where this would get called many times on a machine with
many CPUs before the trace file could be read. The commit 72c9ddfd4c5bf
("ring-buffer: Make non-consuming read less expensive with lots of cpus.")
separated ring_buffer_read_start into ring_buffer_read_prepare(),
ring_buffer_read_sync() and then ring_buffer_read_start() to allow each of
the per CPU buffers to be prepared, call the read_buffer_read_sync() once,
and then the ring_buffer_read_start() for each of the CPUs which made
things much faster.
The commit 1039221cc278 ("ring-buffer: Do not disable recording when there
is an iterator") removed the requirement of disabling the recording of the
ring buffer in order to iterate it, but it did not remove the
synchronization that was happening that was required to wait for all the
buffers to have no more writers. It's now OK for the buffers to have
writers and no synchronization is needed.
Remove the synchronization and put back the interface for the ring buffer
iterator back before commit 72c9ddfd4c5bf was applied.
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20250630180440.3eabb514@batman.local.home
Reported-by: David Howells <dhowells@redhat.com>
Fixes: 1039221cc278 ("ring-buffer: Do not disable recording when there is an iterator")
Tested-by: David Howells <dhowells@redhat.com>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
It appears that the defect outlined in 9c15eeb5362c4 ("genirq: Allow
fasteoi handler to resend interrupts on concurrent handling") also
affects some other less stellar MSI controllers, this time using
the handle_simple_irq() flow.
Teach this flow about irqd_needs_resend_when_in_progress(). Given
the invasive nature of this workaround, only this flow is updated.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Lorenzo Pieralisi <lpieralisi@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20250708173404.1278635-2-maz@kernel.org
|
|
Yicon reported and Liangyan debugged a live lock in handle_edge_irq()
related to interrupt migration.
If the interrupt affinity is moved to a new target CPU and the interrupt is
currently handled on the previous target CPU for edge type interrupts the
handler might get stuck on the previous target:
CPU 0 (previous target) CPU 1 (new target)
handle_edge_irq()
repeat:
handle_event() handle_edge_irq()
if (INPROGESS) {
set(PENDING);
mask();
return;
}
if (PENDING) {
clear(PENDING);
unmask();
goto repeat;
}
The migration in software never completes and CPU0 continues to handle the
pending events forever. This happens when the device raises interrupts with
a high rate and always before handle_event() completes and before the CPU0
handler can clear INPROGRESS so that CPU1 sets the PENDING flag over and
over. This has been observed in virtual machines.
Prevent this by checking whether the CPU which observes the INPROGRESS flag
is the new affinity target. If that's the case, do not set the PENDING flag
and wait for the INPROGRESS flag to be cleared instead, so that the new
interrupt is handled on the new target CPU and the previous CPU is released
from the action.
This is restricted to the edge type handler and only utilized on systems,
which use single CPU targets for interrupt affinity.
Reported-by: Yicong Shen <shenyicong.1023@bytedance.com>
Reported-by: Liangyan <liangyan.peng@bytedance.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Liangyan <liangyan.peng@bytedance.com>
Reviewed-by: Jiri Slaby <jirislaby@kernel.org>
Link: https://lore.kernel.org/all/20250701163558.2588435-1-liangyan.peng@bytedance.com
Link: https://lore.kernel.org/all/20250718185312.076515034@linutronix.de
|
|
Let the calling code check for the IRQD_WAKEUP_ARMED flag to prepare for a
live lock mitigation in the edge type handler.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Liangyan <liangyan.peng@bytedance.com>
Link: https://lore.kernel.org/all/20250718185312.012392426@linutronix.de
|
|
Move it to the call site so that the waiting for the INPROGRESS flag can be
reused by an upcoming mitigation for a potential live lock in the edge type
handler.
No functional change.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Liangyan <liangyan.peng@bytedance.com>
Reviewed-by: Jiri Slaby <jirislaby@kernel.org>
Link: https://lore.kernel.org/all/20250718185311.948555026@linutronix.de
|
|
The variable is only used at one place, which can simply take the constant
as function argument.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Liangyan <liangyan.peng@bytedance.com>
Link: https://lore.kernel.org/all/20250718185311.884314473@linutronix.de
|
|
drivers
Most drivers only populate the fields cycles and cs_id of system_counterval
in their get_time_fn() callback for get_device_system_crosststamp(), unless
they explicitly provide nanosecond values.
When the use_nsecs field was added to struct system_counterval, most
drivers did not care. Clock sources other than CSID_GENERIC could then get
converted in convert_base_to_cs() based on an uninitialized use_nsecs field,
which usually results in -EINVAL during the following range check.
Pass in a fully zero initialized system_counterval_t to cure that.
Fixes: 6b2e29977518 ("timekeeping: Provide infrastructure for converting to/from a base clock")
Signed-off-by: Markus Blöchl <markus@blochl.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20250720-timekeeping_uninit_crossts-v2-1-f513c885b7c2@blochl.de
|
|
The synchronization of CPU offlining with GP initialization is confusing
to put it mildly (rightfully so as the issue it deals with is complex).
Recent discussions brought up a question -- what prevents the
rcu_implicit_dyntick_qs() from warning about QS reports for offline
CPUs (missing QS reports for offline CPUs causing indefinite hangs).
QS reporting for now-offline CPUs should only happen from:
- gp_init()
- rcutree_cpu_report_dead()
Add some documentation on this and refer to it from comments in the code
explaining how QS reporting is not missed when these functions are
concurrently running.
I referred heavily to this post [1] about the need for the ofl_lock.
[1] https://lore.kernel.org/all/20180924164443.GF4222@linux.ibm.com/
[ Applied paulmck feedback on moving documentation to Requirements.rst ]
Link: https://lore.kernel.org/all/01b4d228-9416-43f8-a62e-124b92e8741a@paulmck-laptop/
Co-developed-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: "Paul E. McKenney" <paulmck@kernel.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>
|
|
The details of this are subtle and was discussed recently. Add a
quick-quiz about this and refer to it from the code, for more clarity.
Reviewed-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>
|
|
Add detailed comments explaining the critical ordering constraints
during RCU grace period initialization, based on discussions with
Frederic.
Reviewed-by: "Paul E. McKenney" <paulmck@kernel.org>
Co-developed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.upadhyay@kernel.org>
|
|
To reduce stale data lifetimes, enable CONFIG_INIT_ON_FREE_DEFAULT_ON as
well. This matches the addition of CONFIG_STACKLEAK=y, which is doing
similar for stack memory.
Link: https://lore.kernel.org/r/20250717232519.2984886-13-kees@kernel.org
Signed-off-by: Kees Cook <kees@kernel.org>
|
|
Since we can wipe the stack with both Clang and GCC plugins, enable this
for the "hardening.config" for wider testing.
Link: https://lore.kernel.org/r/20250717232519.2984886-12-kees@kernel.org
Signed-off-by: Kees Cook <kees@kernel.org>
|
|
The Clang stack depth tracking implementation has a fixed name for
the stack depth tracking callback, "__sanitizer_cov_stack_depth", so
rename the GCC plugin function to match since the plugin has no external
dependencies on naming.
Link: https://lore.kernel.org/r/20250717232519.2984886-2-kees@kernel.org
Signed-off-by: Kees Cook <kees@kernel.org>
|
|
In preparation for adding Clang sanitizer coverage stack depth tracking
that can support stack depth callbacks:
- Add the new top-level CONFIG_KSTACK_ERASE option which will be
implemented either with the stackleak GCC plugin, or with the Clang
stack depth callback support.
- Rename CONFIG_GCC_PLUGIN_STACKLEAK as needed to CONFIG_KSTACK_ERASE,
but keep it for anything specific to the GCC plugin itself.
- Rename all exposed "STACKLEAK" names and files to "KSTACK_ERASE" (named
for what it does rather than what it protects against), but leave as
many of the internals alone as possible to avoid even more churn.
While here, also split "prev_lowest_stack" into CONFIG_KSTACK_ERASE_METRICS,
since that's the only place it is referenced from.
Suggested-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250717232519.2984886-1-kees@kernel.org
Signed-off-by: Kees Cook <kees@kernel.org>
|
|
Intel linux test robot reported a warning that ERR_CAST can be used
for error pointer casting instead of more-complicated/rarely-used
ERR_PTR(PTR_ERR(...)) style.
There is no functionality change, but still let us replace two such
instances as it improves consistency and readability.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202507201048.bceHy8zX-lkp@intel.com/
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://patch.msgid.link/20250720164754.3999140-1-yonghong.song@linux.dev
|
|
As the trace event powernv_throttle is only used by the powernv code, move
it to a separate include file and have that code directly enable it.
Trace events can take up around 5K of memory when they are defined
regardless if they are used or not. It wastes memory to have them defined
in configurations where the tracepoint is not used.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/20250612145407.906308844@goodmis.org
Fixes: 0306e481d479a ("cpufreq: powernv/tracing: Add powernv_throttle tracepoint")
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
All architectures have an interruptible RCU extended quiescent state
(EQS) as part of their idle sequences, where interrupts can occur
without RCU watching. Entry code must account for this and wake RCU as
necessary; the common entry code deals with this in irqentry_enter() by
treating any interrupt from an idle thread as potentially having
occurred within an EQS and waking RCU for the duration of the interrupt
via rcu_irq_enter() .. rcu_irq_exit().
Some architectures may have other interruptible EQSs which require
similar treatment. For example, on s390 it is necessary to enable
interrupts around guest entry in the middle of a period where core KVM
code has entered an EQS.
So that architectures can wake RCU in these cases, this patch adds a
new arch_in_rcu_eqs() hook to the common entry code which is checked in
addition to the existing is_idle_thread() check, with RCU woken if
either returns true. A default implementation is provided which always
returns false, which suffices for most architectures.
As no architectures currently implement arch_in_rcu_eqs(), there should
be no functional change as a result of this patch alone. A subsequent
patch will add an s390 implementation to fix a latent bug with missing
RCU wakeups.
[ajd@linux.ibm.com: rebase, fix commit message]
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Signed-off-by: Andrew Donnellan <ajd@linux.ibm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Link: https://lore.kernel.org/r/20250708092742.104309-2-ajd@linux.ibm.com
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Message-ID: <20250708092742.104309-2-ajd@linux.ibm.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix timerlat with use of FORTIFY_SOURCE
FORTIFY_SOURCE was added to the stack tracer where it compares the
entry->caller array to having entry->size elements.
timerlat has the following:
memcpy(&entry->caller, fstack->calls, size);
entry->size = size;
Which triggers FORTIFY_SOURCE as the caller is populated before the
entry->size is initialized.
Swap the order to satisfy FORTIFY_SOURCE logic.
- Add down_write(trace_event_sem) when adding trace events in modules
Trace events being added to the ftrace_events array are protected by
the trace_event_sem semaphore. But when loading modules that have
trace events, the addition of the events are not protected by the
semaphore and loading two modules that have events at the same time
can corrupt the list.
Also add a lockdep_assert_held(trace_event_sem) to
_trace_add_event_dirs() to confirm it is held when iterating the
list.
* tag 'trace-v6.16-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Add down_write(trace_event_sem) when adding trace event
tracing/osnoise: Fix crash in timerlat_dump_stack()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Thomas Gleixner:
"A single fix for the scheduler.
A recent commit changed the runqueue counter nr_uninterruptible to an
unsigned int. Due to the fact that the counters are not updated on
migration of a uninterruptble task to a different CPU, these counters
can exceed INT_MAX.
The counter is cast to long in the load average calculation, which
means that the cast expands into negative space resulting in bogus
load average values.
Convert it back to unsigned long to fix this.
* tag 'sched-urgent-2025-07-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Change nr_uninterruptible type to unsigned long
|
|
Inspired by mutex blocker tracking[1], and having already extended it to
semaphores, let's now add support for reader-writer semaphores (rwsems).
The approach is simple: when a task enters TASK_UNINTERRUPTIBLE while
waiting for an rwsem, we just call hung_task_set_blocker(). The hung task
detector can then query the rwsem's owner to identify the lock holder.
Tracking works reliably for writers, as there can only be a single writer
holding the lock, and its task struct is stored in the owner field.
The main challenge lies with readers. The owner field points to only one
of many concurrent readers, so we might lose track of the blocker if that
specific reader unlocks, even while others remain. This is not a
significant issue, however. In practice, long-lasting lock contention is
almost always caused by a writer. Therefore, reliably tracking the writer
is the primary goal of this patch series ;)
With this change, the hung task detector can now show blocker task's info
like below:
[Fri Jun 27 15:21:34 2025] INFO: task cat:28631 blocked for more than 122 seconds.
[Fri Jun 27 15:21:34 2025] Tainted: G S 6.16.0-rc3 #8
[Fri Jun 27 15:21:34 2025] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Jun 27 15:21:34 2025] task:cat state:D stack:0 pid:28631 tgid:28631 ppid:28501 task_flags:0x400000 flags:0x00004000
[Fri Jun 27 15:21:34 2025] Call Trace:
[Fri Jun 27 15:21:34 2025] <TASK>
[Fri Jun 27 15:21:34 2025] __schedule+0x7c7/0x1930
[Fri Jun 27 15:21:34 2025] ? __pfx___schedule+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? policy_nodemask+0x215/0x340
[Fri Jun 27 15:21:34 2025] ? _raw_spin_lock_irq+0x8a/0xe0
[Fri Jun 27 15:21:34 2025] ? __pfx__raw_spin_lock_irq+0x10/0x10
[Fri Jun 27 15:21:34 2025] schedule+0x6a/0x180
[Fri Jun 27 15:21:34 2025] schedule_preempt_disabled+0x15/0x30
[Fri Jun 27 15:21:34 2025] rwsem_down_read_slowpath+0x55e/0xe10
[Fri Jun 27 15:21:34 2025] ? __pfx_rwsem_down_read_slowpath+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? __pfx___might_resched+0x10/0x10
[Fri Jun 27 15:21:34 2025] down_read+0xc9/0x230
[Fri Jun 27 15:21:34 2025] ? __pfx_down_read+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? __debugfs_file_get+0x14d/0x700
[Fri Jun 27 15:21:34 2025] ? __pfx___debugfs_file_get+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? handle_pte_fault+0x52a/0x710
[Fri Jun 27 15:21:34 2025] ? selinux_file_permission+0x3a9/0x590
[Fri Jun 27 15:21:34 2025] read_dummy_rwsem_read+0x4a/0x90
[Fri Jun 27 15:21:34 2025] full_proxy_read+0xff/0x1c0
[Fri Jun 27 15:21:34 2025] ? rw_verify_area+0x6d/0x410
[Fri Jun 27 15:21:34 2025] vfs_read+0x177/0xa50
[Fri Jun 27 15:21:34 2025] ? __pfx_vfs_read+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? fdget_pos+0x1cf/0x4c0
[Fri Jun 27 15:21:34 2025] ksys_read+0xfc/0x1d0
[Fri Jun 27 15:21:34 2025] ? __pfx_ksys_read+0x10/0x10
[Fri Jun 27 15:21:34 2025] do_syscall_64+0x66/0x2d0
[Fri Jun 27 15:21:34 2025] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[Fri Jun 27 15:21:34 2025] RIP: 0033:0x7f3f8faefb40
[Fri Jun 27 15:21:34 2025] RSP: 002b:00007ffdeda5ab98 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[Fri Jun 27 15:21:34 2025] RAX: ffffffffffffffda RBX: 0000000000010000 RCX: 00007f3f8faefb40
[Fri Jun 27 15:21:34 2025] RDX: 0000000000010000 RSI: 00000000010fa000 RDI: 0000000000000003
[Fri Jun 27 15:21:34 2025] RBP: 00000000010fa000 R08: 0000000000000000 R09: 0000000000010fff
[Fri Jun 27 15:21:34 2025] R10: 00007ffdeda59fe0 R11: 0000000000000246 R12: 00000000010fa000
[Fri Jun 27 15:21:34 2025] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000fff
[Fri Jun 27 15:21:34 2025] </TASK>
[Fri Jun 27 15:21:34 2025] INFO: task cat:28631 <reader> blocked on an rw-semaphore likely owned by task cat:28630 <writer>
[Fri Jun 27 15:21:34 2025] task:cat state:S stack:0 pid:28630 tgid:28630 ppid:28501 task_flags:0x400000 flags:0x00004000
[Fri Jun 27 15:21:34 2025] Call Trace:
[Fri Jun 27 15:21:34 2025] <TASK>
[Fri Jun 27 15:21:34 2025] __schedule+0x7c7/0x1930
[Fri Jun 27 15:21:34 2025] ? __pfx___schedule+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? __mod_timer+0x304/0xa80
[Fri Jun 27 15:21:34 2025] schedule+0x6a/0x180
[Fri Jun 27 15:21:34 2025] schedule_timeout+0xfb/0x230
[Fri Jun 27 15:21:34 2025] ? __pfx_schedule_timeout+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? __pfx_process_timeout+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? down_write+0xc4/0x140
[Fri Jun 27 15:21:34 2025] msleep_interruptible+0xbe/0x150
[Fri Jun 27 15:21:34 2025] read_dummy_rwsem_write+0x54/0x90
[Fri Jun 27 15:21:34 2025] full_proxy_read+0xff/0x1c0
[Fri Jun 27 15:21:34 2025] ? rw_verify_area+0x6d/0x410
[Fri Jun 27 15:21:34 2025] vfs_read+0x177/0xa50
[Fri Jun 27 15:21:34 2025] ? __pfx_vfs_read+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? fdget_pos+0x1cf/0x4c0
[Fri Jun 27 15:21:34 2025] ksys_read+0xfc/0x1d0
[Fri Jun 27 15:21:34 2025] ? __pfx_ksys_read+0x10/0x10
[Fri Jun 27 15:21:34 2025] do_syscall_64+0x66/0x2d0
[Fri Jun 27 15:21:34 2025] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[Fri Jun 27 15:21:34 2025] RIP: 0033:0x7f8f288efb40
[Fri Jun 27 15:21:34 2025] RSP: 002b:00007ffffb631038 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[Fri Jun 27 15:21:34 2025] RAX: ffffffffffffffda RBX: 0000000000010000 RCX: 00007f8f288efb40
[Fri Jun 27 15:21:34 2025] RDX: 0000000000010000 RSI: 000000002a4b5000 RDI: 0000000000000003
[Fri Jun 27 15:21:34 2025] RBP: 000000002a4b5000 R08: 0000000000000000 R09: 0000000000010fff
[Fri Jun 27 15:21:34 2025] R10: 00007ffffb630460 R11: 0000000000000246 R12: 000000002a4b5000
[Fri Jun 27 15:21:34 2025] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000fff
[Fri Jun 27 15:21:34 2025] </TASK>
[1] https://lore.kernel.org/all/174046694331.2194069.15472952050240807469.stgit@mhiramat.tok.corp.google.com/
Link: https://lkml.kernel.org/r/20250627072924.36567-3-lance.yang@linux.dev
Signed-off-by: Lance Yang <lance.yang@linux.dev>
Suggested-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Anna Schumaker <anna.schumaker@oracle.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joel Granados <joel.granados@kernel.org>
Cc: John Stultz <jstultz@google.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Mingzhe Yang <mingzhe.yang@ly.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tomasz Figa <tfiga@chromium.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yongliang Gao <leonylgao@tencent.com>
Cc: Zi Li <zi.li@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "extend hung task blocker tracking to rwsems".
Inspired by mutex blocker tracking[1], and having already extended it to
semaphores, let's now add support for reader-writer semaphores (rwsems).
The approach is simple: when a task enters TASK_UNINTERRUPTIBLE while
waiting for an rwsem, we just call hung_task_set_blocker(). The hung task
detector can then query the rwsem's owner to identify the lock holder.
Tracking works reliably for writers, as there can only be a single writer
holding the lock, and its task struct is stored in the owner field.
The main challenge lies with readers. The owner field points to only one
of many concurrent readers, so we might lose track of the blocker if that
specific reader unlocks, even while others remain. This is not a
significant issue, however. In practice, long-lasting lock contention is
almost always caused by a writer. Therefore, reliably tracking the writer
is the primary goal of this patch series ;)
With this change, the hung task detector can now show blocker task's info
like below:
[Fri Jun 27 15:21:34 2025] INFO: task cat:28631 blocked for more than 122 seconds.
[Fri Jun 27 15:21:34 2025] Tainted: G S 6.16.0-rc3 #8
[Fri Jun 27 15:21:34 2025] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Jun 27 15:21:34 2025] task:cat state:D stack:0 pid:28631 tgid:28631 ppid:28501 task_flags:0x400000 flags:0x00004000
[Fri Jun 27 15:21:34 2025] Call Trace:
[Fri Jun 27 15:21:34 2025] <TASK>
[Fri Jun 27 15:21:34 2025] __schedule+0x7c7/0x1930
[Fri Jun 27 15:21:34 2025] ? __pfx___schedule+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? policy_nodemask+0x215/0x340
[Fri Jun 27 15:21:34 2025] ? _raw_spin_lock_irq+0x8a/0xe0
[Fri Jun 27 15:21:34 2025] ? __pfx__raw_spin_lock_irq+0x10/0x10
[Fri Jun 27 15:21:34 2025] schedule+0x6a/0x180
[Fri Jun 27 15:21:34 2025] schedule_preempt_disabled+0x15/0x30
[Fri Jun 27 15:21:34 2025] rwsem_down_read_slowpath+0x55e/0xe10
[Fri Jun 27 15:21:34 2025] ? __pfx_rwsem_down_read_slowpath+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? __pfx___might_resched+0x10/0x10
[Fri Jun 27 15:21:34 2025] down_read+0xc9/0x230
[Fri Jun 27 15:21:34 2025] ? __pfx_down_read+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? __debugfs_file_get+0x14d/0x700
[Fri Jun 27 15:21:34 2025] ? __pfx___debugfs_file_get+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? handle_pte_fault+0x52a/0x710
[Fri Jun 27 15:21:34 2025] ? selinux_file_permission+0x3a9/0x590
[Fri Jun 27 15:21:34 2025] read_dummy_rwsem_read+0x4a/0x90
[Fri Jun 27 15:21:34 2025] full_proxy_read+0xff/0x1c0
[Fri Jun 27 15:21:34 2025] ? rw_verify_area+0x6d/0x410
[Fri Jun 27 15:21:34 2025] vfs_read+0x177/0xa50
[Fri Jun 27 15:21:34 2025] ? __pfx_vfs_read+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? fdget_pos+0x1cf/0x4c0
[Fri Jun 27 15:21:34 2025] ksys_read+0xfc/0x1d0
[Fri Jun 27 15:21:34 2025] ? __pfx_ksys_read+0x10/0x10
[Fri Jun 27 15:21:34 2025] do_syscall_64+0x66/0x2d0
[Fri Jun 27 15:21:34 2025] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[Fri Jun 27 15:21:34 2025] RIP: 0033:0x7f3f8faefb40
[Fri Jun 27 15:21:34 2025] RSP: 002b:00007ffdeda5ab98 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[Fri Jun 27 15:21:34 2025] RAX: ffffffffffffffda RBX: 0000000000010000 RCX: 00007f3f8faefb40
[Fri Jun 27 15:21:34 2025] RDX: 0000000000010000 RSI: 00000000010fa000 RDI: 0000000000000003
[Fri Jun 27 15:21:34 2025] RBP: 00000000010fa000 R08: 0000000000000000 R09: 0000000000010fff
[Fri Jun 27 15:21:34 2025] R10: 00007ffdeda59fe0 R11: 0000000000000246 R12: 00000000010fa000
[Fri Jun 27 15:21:34 2025] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000fff
[Fri Jun 27 15:21:34 2025] </TASK>
[Fri Jun 27 15:21:34 2025] INFO: task cat:28631 <reader> blocked on an rw-semaphore likely owned by task cat:28630 <writer>
[Fri Jun 27 15:21:34 2025] task:cat state:S stack:0 pid:28630 tgid:28630 ppid:28501 task_flags:0x400000 flags:0x00004000
[Fri Jun 27 15:21:34 2025] Call Trace:
[Fri Jun 27 15:21:34 2025] <TASK>
[Fri Jun 27 15:21:34 2025] __schedule+0x7c7/0x1930
[Fri Jun 27 15:21:34 2025] ? __pfx___schedule+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? __mod_timer+0x304/0xa80
[Fri Jun 27 15:21:34 2025] schedule+0x6a/0x180
[Fri Jun 27 15:21:34 2025] schedule_timeout+0xfb/0x230
[Fri Jun 27 15:21:34 2025] ? __pfx_schedule_timeout+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? __pfx_process_timeout+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? down_write+0xc4/0x140
[Fri Jun 27 15:21:34 2025] msleep_interruptible+0xbe/0x150
[Fri Jun 27 15:21:34 2025] read_dummy_rwsem_write+0x54/0x90
[Fri Jun 27 15:21:34 2025] full_proxy_read+0xff/0x1c0
[Fri Jun 27 15:21:34 2025] ? rw_verify_area+0x6d/0x410
[Fri Jun 27 15:21:34 2025] vfs_read+0x177/0xa50
[Fri Jun 27 15:21:34 2025] ? __pfx_vfs_read+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? fdget_pos+0x1cf/0x4c0
[Fri Jun 27 15:21:34 2025] ksys_read+0xfc/0x1d0
[Fri Jun 27 15:21:34 2025] ? __pfx_ksys_read+0x10/0x10
[Fri Jun 27 15:21:34 2025] do_syscall_64+0x66/0x2d0
[Fri Jun 27 15:21:34 2025] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[Fri Jun 27 15:21:34 2025] RIP: 0033:0x7f8f288efb40
[Fri Jun 27 15:21:34 2025] RSP: 002b:00007ffffb631038 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[Fri Jun 27 15:21:34 2025] RAX: ffffffffffffffda RBX: 0000000000010000 RCX: 00007f8f288efb40
[Fri Jun 27 15:21:34 2025] RDX: 0000000000010000 RSI: 000000002a4b5000 RDI: 0000000000000003
[Fri Jun 27 15:21:34 2025] RBP: 000000002a4b5000 R08: 0000000000000000 R09: 0000000000010fff
[Fri Jun 27 15:21:34 2025] R10: 00007ffffb630460 R11: 0000000000000246 R12: 000000002a4b5000
[Fri Jun 27 15:21:34 2025] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000fff
[Fri Jun 27 15:21:34 2025] </TASK>
This patch (of 3):
In preparation for extending blocker tracking to support rwsems, make the
rwsem_owner() and is_rwsem_reader_owned() helpers globally available for
determining if the blocker is a writer or one of the readers.
Additionally, a stale owner pointer in a reader-owned rwsem can lead to
false positives in blocker tracking when CONFIG_DETECT_HUNG_TASK_BLOCKER
is enabled. To mitigate this, clear the owner field on the reader unlock
path, similar to what CONFIG_DEBUG_RWSEMS does. A NULL owner is better
than a stale one for diagnostics.
Link: https://lkml.kernel.org/r/20250627072924.36567-1-lance.yang@linux.dev
Link: https://lkml.kernel.org/r/20250627072924.36567-2-lance.yang@linux.dev
Link: https://lore.kernel.org/all/174046694331.2194069.15472952050240807469.stgit@mhiramat.tok.corp.google.com/ [1]
Signed-off-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Anna Schumaker <anna.schumaker@oracle.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joel Granados <joel.granados@kernel.org>
Cc: John Stultz <jstultz@google.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Mingzhe Yang <mingzhe.yang@ly.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tomasz Figa <tfiga@chromium.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yongliang Gao <leonylgao@tencent.com>
Cc: Zi Li <zi.li@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Add a dedicated core parameter 'panic_console_replay' for controlling
console replay, and add note that 'panic_print' sysctl interface will be
obsoleted by 'panic_sys_info' and 'panic_console_replay'. When it
happens, the SYS_INFO_PANIC_CONSOLE_REPLAY can be removed as well.
Link: https://lkml.kernel.org/r/20250703021004.42328-6-feng.tang@linux.alibaba.com
Signed-off-by: Feng Tang <feng.tang@linux.alibaba.com>
Suggested-by: Petr Mladek <pmladek@suse.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
'panic_sys_info=' sysctl interface is already added for runtime setting.
Add counterpart kernel cmdline option for boottime setting.
Link: https://lkml.kernel.org/r/20250703021004.42328-5-feng.tang@linux.alibaba.com
Signed-off-by: Feng Tang <feng.tang@linux.alibaba.com>
Suggested-by: Petr Mladek <pmladek@suse.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Bitmap definition for 'panic_print' is hard to remember and decode. Add
'panic_sys_info='sysctl to take human readable string like
"tasks,mem,timers,locks,ftrace,..." and translate it into bitmap.
The detailed mapping is:
SYS_INFO_TASKS "tasks"
SYS_INFO_MEM "mem"
SYS_INFO_TIMERS "timers"
SYS_INFO_LOCKS "locks"
SYS_INFO_FTRACE "ftrace"
SYS_INFO_ALL_CPU_BT "all_bt"
SYS_INFO_BLOCKED_TASKS "blocked_tasks"
[nathan@kernel.org: add __maybe_unused to sys_info_avail]
Link: https://lkml.kernel.org/r/20250708-fix-clang-sys_info_avail-warning-v1-1-60d239eacd64@kernel.org
Link: https://lkml.kernel.org/r/20250703021004.42328-4-feng.tang@linux.alibaba.com
Signed-off-by: Feng Tang <feng.tang@linux.alibaba.com>
Suggested-by: Petr Mladek <pmladek@suse.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
'panic_print' was introduced to help debugging kernel panic by dumping
different kinds of system information like tasks' call stack, memory,
ftrace buffer, etc. Actually this function could also be used to help
debugging other cases like task-hung, soft/hard lockup, etc. where user
may need the snapshot of system info at that time.
Extract system info dump function related code from panic.c to separate
file sys_info.[ch], for wider usage by other kernel parts for debugging.
Also modify the macro names about singulars/plurals.
Link: https://lkml.kernel.org/r/20250703021004.42328-3-feng.tang@linux.alibaba.com
Signed-off-by: Feng Tang <feng.tang@linux.alibaba.com>
Suggested-by: Petr Mladek <pmladek@suse.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "generalize panic_print's dump function to be used by other
kernel parts", v3.
When working on kernel stability issues, panic, task-hung and
software/hardware lockup are frequently met. And to debug them, user may
need lots of system information at that time, like task call stacks, lock
info, memory info etc.
panic case already has panic_print_sys_info() for this purpose, and has a
'panic_print' bitmask to control what kinds of information is needed,
which is also helpful to debug other task-hung and lockup cases.
So this patchset extracts the function out to a new file 'lib/sys_info.c',
and makes it available for other cases which also need to dump system info
for debugging.
Also as suggested by Petr Mladek, add 'panic_sys_info=' interface to take
human readable string like "tasks,mem,locks,timers,ftrace,....", and
eventually obsolete the current 'panic_print' bitmap interface.
In RFC and V1 version, hung_task and SW/HW watchdog modules are enabled
with the new sys_info dump interface. In v2, they are kept out for better
review of current change, and will be posted later.
Locally these have been used in our bug chasing for stability issues and
was proven helpful.
Many thanks to Petr Mladek for great suggestions on both the code and
architectures!
This patch (of 5):
Currently the panic_print_sys_info() was called twice with different
parameters to handle console replay case, which is kind of confusing.
Add panic_console_replay() explicitly and rename
'PANIC_PRINT_ALL_PRINTK_MSG' to 'PANIC_CONSOLE_REPLAY', to make the code
straightforward. The related kernel document is also updated.
Link: https://lkml.kernel.org/r/20250703021004.42328-1-feng.tang@linux.alibaba.com
Link: https://lkml.kernel.org/r/20250703021004.42328-2-feng.tang@linux.alibaba.com
Signed-off-by: Feng Tang <feng.tang@linux.alibaba.com>
Suggested-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When re-using the CMA area for kdump there is a risk of pending DMA into
pinned user pages in the CMA area.
Pages residing in CMA areas can usually not get long-term pinned and are
instead migrated away from the CMA area, so long-term pinning is typically
not a concern. (BUGs in the kernel might still lead to long-term pinning
of such pages if everything goes wrong.)
Pages pinned without FOLL_LONGTERM remain in the CMA and may possibly be
the source or destination of a pending DMA transfer.
Although there is no clear specification how long a page may be pinned
without FOLL_LONGTERM, pinning without the flag shows an intent of the
caller to only use the memory for short-lived DMA transfers, not a
transfer initiated by a device asynchronously at a random time in the
future.
Add a delay of CMA_DMA_TIMEOUT_SEC seconds before starting the kdump
kernel, giving such short-lived DMA transfers time to finish before the
CMA memory is re-used by the kdump kernel.
Set CMA_DMA_TIMEOUT_SEC to 10 seconds - chosen arbitrarily as both a huge
margin for a DMA transfer, yet not increasing the kdump time too
significantly.
Link: https://lkml.kernel.org/r/aEqpgDIBndZ5LXSo@dwarf.suse.cz
Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Donald Dutile <ddutile@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Philipp Rudo <prudo@redhat.com>
Cc: Pingfan Liu <piliu@redhat.com>
Cc: Tao Liu <ltao@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
reserve_crashkernel_cma() reserves CMA ranges for the crash kernel. If
allocating the requested size fails, try to reserve in smaller blocks.
Store the reserved ranges in the crashk_cma_ranges array and the number of
ranges in crashk_cma_cnt.
Link: https://lkml.kernel.org/r/aEqpBwOy_ekm0gw9@dwarf.suse.cz
Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Donald Dutile <ddutile@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Philipp Rudo <prudo@redhat.com>
Cc: Pingfan Liu <piliu@redhat.com>
Cc: Tao Liu <ltao@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "kdump: crashkernel reservation from CMA", v5.
This series implements a way to reserve additional crash kernel memory
using CMA.
Currently, all the memory for the crash kernel is not usable by the 1st
(production) kernel. It is also unmapped so that it can't be corrupted by
the fault that will eventually trigger the crash. This makes sense for
the memory actually used by the kexec-loaded crash kernel image and initrd
and the data prepared during the load (vmcoreinfo, ...). However, the
reserved space needs to be much larger than that to provide enough
run-time memory for the crash kernel and the kdump userspace. Estimating
the amount of memory to reserve is difficult. Being too careful makes
kdump likely to end in OOM, being too generous takes even more memory from
the production system. Also, the reservation only allows reserving a
single contiguous block (or two with the "low" suffix). I've seen systems
where this fails because the physical memory is fragmented.
By reserving additional crashkernel memory from CMA, the main crashkernel
reservation can be just large enough to fit the kernel and initrd image,
minimizing the memory taken away from the production system. Most of the
run-time memory for the crash kernel will be memory previously available
to userspace in the production system. As this memory is no longer
wasted, the reservation can be done with a generous margin, making kdump
more reliable. Kernel memory that we need to preserve for dumping is
normally not allocated from CMA, unless it is explicitly allocated as
movable. Currently this is only the case for memory ballooning and zswap.
Such movable memory will be missing from the vmcore. User data is
typically not dumped by makedumpfile. When dumping of user data is
intended this new CMA reservation cannot be used.
There are five patches in this series:
The first adds a new ",cma" suffix to the recenly introduced generic
crashkernel parsing code. parse_crashkernel() takes one more argument to
store the cma reservation size.
The second patch implements reserve_crashkernel_cma() which performs the
reservation. If the requested size is not available in a single range,
multiple smaller ranges will be reserved.
The third patch updates Documentation/, explicitly mentioning the
potential DMA corruption of the CMA-reserved memory.
The fourth patch adds a short delay before booting the kdump kernel,
allowing pending DMA transfers to finish.
The fifth patch enables the functionality for x86 as a proof of
concept. There are just three things every arch needs to do:
- call reserve_crashkernel_cma()
- include the CMA-reserved ranges in the physical memory map
- exclude the CMA-reserved ranges from the memory available
through /proc/vmcore by excluding them from the vmcoreinfo
PT_LOAD ranges.
Adding other architectures is easy and I can do that as soon as this
series is merged.
With this series applied, specifying
crashkernel=100M craskhernel=1G,cma
on the command line will make a standard crashkernel reservation
of 100M, where kexec will load the kernel and initrd.
An additional 1G will be reserved from CMA, still usable by the production
system. The crash kernel will have 1.1G memory available. The 100M can
be reliably predicted based on the size of the kernel and initrd.
The new cma suffix is completely optional. When no
crashkernel=size,cma is specified, everything works as before.
This patch (of 5):
Add a new cma_size parameter to parse_crashkernel(). When not NULL, call
__parse_crashkernel to parse the CMA reservation size from
"crashkernel=size,cma" and store it in cma_size.
Set cma_size to NULL in all calls to parse_crashkernel().
Link: https://lkml.kernel.org/r/aEqnxxfLZMllMC8I@dwarf.suse.cz
Link: https://lkml.kernel.org/r/aEqoQckgoTQNULnh@dwarf.suse.cz
Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Donald Dutile <ddutile@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Philipp Rudo <prudo@redhat.com>
Cc: Pingfan Liu <piliu@redhat.com>
Cc: Tao Liu <ltao@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When a module is loaded, it adds trace events defined by the module. It
may also need to modify the modules trace printk formats to replace enum
names with their values.
If two modules are loaded at the same time, the adding of the event to the
ftrace_events list can corrupt the walking of the list in the code that is
modifying the printk format strings and crash the kernel.
The addition of the event should take the trace_event_sem for write while
it adds the new event.
Also add a lockdep_assert_held() on that semaphore in
__trace_add_event_dirs() as it iterates the list.
Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://lore.kernel.org/20250718223158.799bfc0c@batman.local.home
Reported-by: Fusheng Huang(黄富生) <Fusheng.Huang@luxshare-ict.com>
Closes: https://lore.kernel.org/all/20250717105007.46ccd18f@batman.local.home/
Fixes: 110bf2b764eb6 ("tracing: add protection around module events unload")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
- Fix handling of migration disabled tasks in default idle selection
- update_locked_rq() called __this_cpu_write() spuriously with NULL
when @rq was not locked. As the writes were spurious, it didn't break
anything directly. However, the function could be called in a
preemptible leading to a context warning in __this_cpu_write(). Skip
the spurious NULL writes.
- Selftest fix on UP
* tag 'sched_ext-for-6.16-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: idle: Handle migration-disabled tasks in idle selection
sched/ext: Prevent update_locked_rq() calls with NULL rq
selftests/sched_ext: Fix exit selftest hang on UP
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
"An earlier commit to suppress a warning introduced a race condition
where tasks can escape cgroup1 freezer. Revert the commit and simply
remove the warning which was spurious to begin with"
* tag 'cgroup-for-6.16-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
Revert "cgroup_freezer: cgroup_freezing: Check if not frozen"
sched,freezer: Remove unnecessary warning in __thaw_task
|
|
/proc/cgroups lists only v1 controllers by default, however, this is
only enforced since the commit af000ce85293b ("cgroup: Do not report
unavailable v1 controllers in /proc/cgroups") and there is software in
the wild that uses content of /proc/cgroups to decide on availability of
v2 (sic) controllers.
Add a boottime param that can bring back the previous behavior for
setups where the check in the software cannot be changed and it causes
e.g. unintended OOMs.
Also, this patch takes out cgrp_v1_visible from cgroup1_subsys_absent()
guard since it's only important to check which hierarchy (v1 vs v2) the
subsys is attached to. This has no effect on the printed message but
the code is cleaner since cgrp_v1_visible is really about mounted
hierarchies, not the content of /proc/cgroups.
Link: https://lore.kernel.org/r/b26b60b7d0d2a5ecfd2f3c45f95f32922ed24686.camel@decadent.org.uk
Fixes: af000ce85293b ("cgroup: Do not report unavailable v1 controllers in /proc/cgroups")
Fixes: a0ab1453226d8 ("cgroup: Print message when /proc/cgroups is read on v2-only system")
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
We have observed kernel panics when using timerlat with stack saving,
with the following dmesg output:
memcpy: detected buffer overflow: 88 byte write of buffer size 0
WARNING: CPU: 2 PID: 8153 at lib/string_helpers.c:1032 __fortify_report+0x55/0xa0
CPU: 2 UID: 0 PID: 8153 Comm: timerlatu/2 Kdump: loaded Not tainted 6.15.3-200.fc42.x86_64 #1 PREEMPT(lazy)
Call Trace:
<TASK>
? trace_buffer_lock_reserve+0x2a/0x60
__fortify_panic+0xd/0xf
__timerlat_dump_stack.cold+0xd/0xd
timerlat_dump_stack.part.0+0x47/0x80
timerlat_fd_read+0x36d/0x390
vfs_read+0xe2/0x390
? syscall_exit_to_user_mode+0x1d5/0x210
ksys_read+0x73/0xe0
do_syscall_64+0x7b/0x160
? exc_page_fault+0x7e/0x1a0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
__timerlat_dump_stack() constructs the ftrace stack entry like this:
struct stack_entry *entry;
...
memcpy(&entry->caller, fstack->calls, size);
entry->size = fstack->nr_entries;
Since commit e7186af7fb26 ("tracing: Add back FORTIFY_SOURCE logic to
kernel_stack event structure"), struct stack_entry marks its caller
field with __counted_by(size). At the time of the memcpy, entry->size
contains garbage from the ringbuffer, which under some circumstances is
zero, triggering a kernel panic by buffer overflow.
Populate the size field before the memcpy so that the out-of-bounds
check knows the correct size. This is analogous to
__ftrace_trace_stack().
Cc: stable@vger.kernel.org
Cc: John Kacur <jkacur@redhat.com>
Cc: Luis Goncalves <lgoncalv@redhat.com>
Cc: Attila Fazekas <afazekas@redhat.com>
Link: https://lore.kernel.org/20250716143601.7313-1-tglozar@redhat.com
Fixes: e7186af7fb26 ("tracing: Add back FORTIFY_SOURCE logic to kernel_stack event structure")
Signed-off-by: Tomas Glozar <tglozar@redhat.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|