Age | Commit message (Collapse) | Author |
|
This is the location regular userspace expects this definition.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://lore.kernel.org/r/20250515-nolibc-sys-v1-3-74f82eea3b59@weissschuh.net
|
|
This is the location regular userspace expects this definition.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://lore.kernel.org/r/20250515-nolibc-sys-v1-2-74f82eea3b59@weissschuh.net
|
|
This is the location regular userspace expects this definition.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://lore.kernel.org/r/20250515-nolibc-sys-v1-1-74f82eea3b59@weissschuh.net
|
|
Newer architectures like riscv 32-bit are missing sys_wait4().
Make use of the fact that wait(&status) is defined to be equivalent to
waitpid(-1, status, 0) to implement it on all architectures.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://lore.kernel.org/r/20250428-nolibc-misc-v2-15-3c043eeab06c@linutronix.de
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
|
|
Newer architectures (like riscv32) do not implement sys_gettimeofday().
In those cases fall back to sys_clock_gettime().
While that does not support the timezone argument of sys_gettimeofday(),
specifying this argument invokes undefined behaviour, so it's safe to ignore.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://lore.kernel.org/r/20250428-nolibc-misc-v2-14-3c043eeab06c@linutronix.de
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
|
|
This is used in various selftests and will be handy when integrating
those with nolibc.
Only the standard POSIX modes are supported.
No extensions nor the (noop) "b" from ISO C are accepted.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://lore.kernel.org/r/20250428-nolibc-misc-v2-13-3c043eeab06c@linutronix.de
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
|
|
This is used in various selftests and will be handy when integrating
those with nolibc.
Not all configurations support namespaces, so skip the tests where
necessary. Also if the tests are running without privileges.
Enable the namespace configuration for those architectures where it is not
enabled by default.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://lore.kernel.org/r/20250428-nolibc-misc-v2-12-3c043eeab06c@linutronix.de
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
|
|
This is used in various selftests and will be handy when integrating
those with nolibc.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://lore.kernel.org/r/20250428-nolibc-misc-v2-11-3c043eeab06c@linutronix.de
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
|
|
This is used in various selftests and will be handy when integrating
those with nolibc.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://lore.kernel.org/r/20250428-nolibc-misc-v2-10-3c043eeab06c@linutronix.de
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
|
|
This is used in various selftests and will be handy when integrating
those with nolibc.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://lore.kernel.org/r/20250428-nolibc-misc-v2-9-3c043eeab06c@linutronix.de
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
|
|
This is used in various selftests and will be handy when integrating
those with nolibc.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://lore.kernel.org/r/20250428-nolibc-misc-v2-8-3c043eeab06c@linutronix.de
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
|
|
This is used in various selftests and will be handy when integrating
those with nolibc.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://lore.kernel.org/r/20250428-nolibc-misc-v2-7-3c043eeab06c@linutronix.de
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
|
|
This is used in various selftests and will be handy when integrating
those with nolibc.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://lore.kernel.org/r/20250428-nolibc-misc-v2-6-3c043eeab06c@linutronix.de
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
|
|
This is used in various selftests and will be handy when integrating
those with nolibc.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://lore.kernel.org/r/20250428-nolibc-misc-v2-5-3c043eeab06c@linutronix.de
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
|
|
This is used in various selftests and will be handy when integrating
those with nolibc.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://lore.kernel.org/r/20250428-nolibc-misc-v2-4-3c043eeab06c@linutronix.de
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
|
|
Add fstat(), fstatat() and lstat(). All of them use the existing implementation
based on statx().
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://lore.kernel.org/r/20250428-nolibc-misc-v2-3-3c043eeab06c@linutronix.de
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
|
|
The %m format can be used to format the current errno.
It is non-standard but supported by other commonly used libcs like glibc and
musl, so applications do rely on them.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://lore.kernel.org/r/20250428-nolibc-misc-v2-2-3c043eeab06c@linutronix.de
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
|
|
This is used in various selftests and will be handy when integrating
those with nolibc.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://lore.kernel.org/r/20250428-nolibc-misc-v2-1-3c043eeab06c@linutronix.de
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
|
|
The UAPI headers already provide definitions for these symbols.
Using them makes the code shorter, more robust and compatible with
applications using linux/poll.h directly.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://lore.kernel.org/r/20250430-poll-v1-2-44b5ceabdeee@linutronix.de
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
|
|
This is the location regular userspace expects the definition.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://lore.kernel.org/r/20250430-poll-v1-1-44b5ceabdeee@linutronix.de
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
|
|
Add nolibc support for m68k. Should be helpful for nommu where
linking libc can bloat even hello world to the point where you get
an OOM just trying to load it.
Signed-off-by: Daniel Palmer <daniel@thingy.jp>
Link: https://lore.kernel.org/r/20250426224738.284874-1-daniel@0x0f.com
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
|
|
Prevent regressions of issues validates by the header check by always
running it together with the nolibc selftests.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://lore.kernel.org/r/20250424-nolibc-header-check-v1-3-011576b6ed6f@linutronix.de
|
|
Inclusion of any nolibc header file should also bring all other headers.
On the other hand it should also be possible to include any nolibc header
files
in any order.
Currently this is implemented by including the catch-all nolibc.h after the
headers own definitions.
This is problematic if one nolibc header depends on another one.
The first header has to include the other one before defining any symbols.
That in turn will include the rest of nolibc while the current header has
not defined anything yet. If any other part of nolibc depends on
definitions from the current header, errors are encountered.
This is already the case today. Effectively nolibc can only be included in
the order of nolibc.h.
Restructure the way "nolibc.h" is included.
Move it to the beginning of the header files and before the include guards.
Now any header will behave exactly like "nolibc.h" while the include
guards prevent any duplicate definitions.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://lore.kernel.org/r/20250424-nolibc-header-check-v1-2-011576b6ed6f@linutronix.de
|
|
Each nolibc header should be valid for inclusion irrespective of any
special ordering requirements.
Add a new make target, based on the old kbuild "make header_check" target
to validate this requirement.
For now the check fails, but the following commits will fix the issues.
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Acked-by: Willy Tarreau <w@1wt.eu>
Link: https://lore.kernel.org/r/20250424-nolibc-header-check-v1-1-011576b6ed6f@linutronix.de
|
|
Leaving the CQ critical section in the middle of a overflow flushing
can cause cqe reordering since the cache cq pointers are reset and any
new cqe emitters that might get called in between are not going to be
forced into io_cqe_cache_refill().
Fixes: eac2ca2d682f9 ("io_uring: check if we need to reschedule during overflow flush")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/90ba817f1a458f091f355f407de1c911d2b93bbf.1747483784.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Commit 4dbd2b2ebe4c ("nvme-multipath: Add visibility for round-robin
io-policy") introduced the creation of the multipath sysfs group under
the NVMe head gendisk device node. However, it also inadvertently added
the same sysfs group under each namespace path device which head node
refers to and that is incorrect.
The multipath sysfs group should only be exposed through the namespace
head gendisk node. This is sufficient, as the head device already
provides symbolic links to the individual namespace paths it manages.
This patch fixes the issue by preventing the creation of the multipath
sysfs group under namespace path devices, ensuring it only appears under
the head disk node.
Fixes: 4dbd2b2ebe4c ("nvme-multipath: Add visibility for round-robin io-policy")
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Christian Brauner <brauner@kernel.org> says:
Coredumping currently supports two modes:
(1) Dumping directly into a file somewhere on the filesystem.
(2) Dumping into a pipe connected to a usermode helper process
spawned as a child of the system_unbound_wq or kthreadd.
For simplicity I'm mostly ignoring (1). There's probably still some
users of (1) out there but processing coredumps in this way can be
considered adventurous especially in the face of set*id binaries.
The most common option should be (2) by now. It works by allowing
userspace to put a string into /proc/sys/kernel/core_pattern like:
|/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
The "|" at the beginning indicates to the kernel that a pipe must be
used. The path following the pipe indicator is a path to a binary that
will be spawned as a usermode helper process. Any additional parameters
pass information about the task that is generating the coredump to the
binary that processes the coredump.
In the example core_pattern shown above systemd-coredump is spawned as a
usermode helper. There's various conceptual consequences of this
(non-exhaustive list):
- systemd-coredump is spawned with file descriptor number 0 (stdin)
connected to the read-end of the pipe. All other file descriptors are
closed. That specifically includes 1 (stdout) and 2 (stderr). This has
already caused bugs because userspace assumed that this cannot happen
(Whether or not this is a sane assumption is irrelevant.).
- systemd-coredump will be spawned as a child of system_unbound_wq. So
it is not a child of any userspace process and specifically not a
child of PID 1. It cannot be waited upon and is in a weird hybrid
upcall which are difficult for userspace to control correctly.
- systemd-coredump is spawned with full kernel privileges. This
necessitates all kinds of weird privilege dropping excercises in
userspace to make this safe.
- A new usermode helper has to be spawned for each crashing process.
This series adds a new mode:
(3) Dumping into an AF_UNIX socket.
Userspace can set /proc/sys/kernel/core_pattern to:
@/path/to/coredump.socket
The "@" at the beginning indicates to the kernel that an AF_UNIX
coredump socket will be used to process coredumps.
The coredump socket must be located in the initial mount namespace.
When a task coredumps it opens a client socket in the initial network
namespace and connects to the coredump socket.
- The coredump server should use SO_PEERPIDFD to get a stable handle on
the connected crashing task. The retrieved pidfd will provide a stable
reference even if the crashing task gets SIGKILLed while generating
the coredump.
- By setting core_pipe_limit non-zero userspace can guarantee that the
crashing task cannot be reaped behind it's back and thus process all
necessary information in /proc/<pid>. The SO_PEERPIDFD can be used to
detect whether /proc/<pid> still refers to the same process.
The core_pipe_limit isn't used to rate-limit connections to the
socket. This can simply be done via AF_UNIX socket directly.
- The pidfd for the crashing task will contain information how the task
coredumps. The PIDFD_GET_INFO ioctl gained a new flag
PIDFD_INFO_COREDUMP which can be used to retreive the coredump
information.
If the coredump gets a new coredump client connection the kernel
guarantees that PIDFD_INFO_COREDUMP information is available.
Currently the following information is provided in the new
@coredump_mask extension to struct pidfd_info:
* PIDFD_COREDUMPED is raised if the task did actually coredump.
* PIDFD_COREDUMP_SKIP is raised if the task skipped coredumping (e.g.,
undumpable).
* PIDFD_COREDUMP_USER is raised if this is a regular coredump and
doesn't need special care by the coredump server.
* PIDFD_COREDUMP_ROOT is raised if the generated coredump should be
treated as sensitive and the coredump server should restrict access
to the generated coredump to sufficiently privileged users.
- The coredump server should mark itself as non-dumpable.
- A container coredump server in a separate network namespace can simply
bind to another well-know address and systemd-coredump fowards
coredumps to the container.
- Coredumps could in the future also be handled via per-user/session
coredump servers that run only with that users privileges.
The coredump server listens on the coredump socket and accepts a
new coredump connection. It then retrieves SO_PEERPIDFD for the
client, inspects uid/gid and hands the accepted client to the users
own coredump handler which runs with the users privileges only
(It must of coure pay close attention to not forward crashing suid
binaries.).
The new coredump socket will allow userspace to not have to rely on
usermode helpers for processing coredumps and provides a safer way to
handle them instead of relying on super privileged coredumping helpers.
This will also be significantly more lightweight since no fork()+exec()
for the usermodehelper is required for each crashing process. The
coredump server in userspace can just keep a worker pool.
* patches from https://lore.kernel.org/20250516-work-coredump-socket-v8-0-664f3caf2516@kernel.org:
selftests/coredump: add tests for AF_UNIX coredumps
selftests/pidfd: add PIDFD_INFO_COREDUMP infrastructure
coredump: validate socket name as it is written
coredump: show supported coredump modes
pidfs, coredump: add PIDFD_INFO_COREDUMP
coredump: add coredump socket
coredump: reflow dump helpers a little
coredump: massage do_coredump()
coredump: massage format_corename()
Link: https://lore.kernel.org/20250516-work-coredump-socket-v8-0-664f3caf2516@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Add a simple test for generating coredumps via AF_UNIX sockets.
Link: https://lore.kernel.org/20250516-work-coredump-socket-v8-9-664f3caf2516@kernel.org
Acked-by: Luca Boccassi <luca.boccassi@gmail.com>
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Add PIDFD_INFO_COREDUMP infrastructure so we can use it in tests.
Link: https://lore.kernel.org/20250516-work-coredump-socket-v8-8-664f3caf2516@kernel.org
Acked-by: Luca Boccassi <luca.boccassi@gmail.com>
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
In contrast to other parameters written into
/proc/sys/kernel/core_pattern that never fail we can validate enabling
the new AF_UNIX support. This is obviously racy as hell but it's always
been that way.
Link: https://lore.kernel.org/20250516-work-coredump-socket-v8-7-664f3caf2516@kernel.org
Acked-by: Luca Boccassi <luca.boccassi@gmail.com>
Reviewed-by: Jann Horn <jannh@google.com>
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Allow userspace to discover what coredump modes are supported.
Link: https://lore.kernel.org/20250516-work-coredump-socket-v8-6-664f3caf2516@kernel.org
Acked-by: Luca Boccassi <luca.boccassi@gmail.com>
Reviewed-by: Jann Horn <jannh@google.com>
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Extend the PIDFD_INFO_COREDUMP ioctl() with the new PIDFD_INFO_COREDUMP
mask flag. This adds the @coredump_mask field to struct pidfd_info.
When a task coredumps the kernel will provide the following information
to userspace in @coredump_mask:
* PIDFD_COREDUMPED is raised if the task did actually coredump.
* PIDFD_COREDUMP_SKIP is raised if the task skipped coredumping (e.g.,
undumpable).
* PIDFD_COREDUMP_USER is raised if this is a regular coredump and
doesn't need special care by the coredump server.
* PIDFD_COREDUMP_ROOT is raised if the generated coredump should be
treated as sensitive and the coredump server should restrict to the
generated coredump to sufficiently privileged users.
The kernel guarantees that by the time the connection is made the all
PIDFD_INFO_COREDUMP info is available.
Link: https://lore.kernel.org/20250516-work-coredump-socket-v8-5-664f3caf2516@kernel.org
Acked-by: Luca Boccassi <luca.boccassi@gmail.com>
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Jann Horn <jannh@google.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Coredumping currently supports two modes:
(1) Dumping directly into a file somewhere on the filesystem.
(2) Dumping into a pipe connected to a usermode helper process
spawned as a child of the system_unbound_wq or kthreadd.
For simplicity I'm mostly ignoring (1). There's probably still some
users of (1) out there but processing coredumps in this way can be
considered adventurous especially in the face of set*id binaries.
The most common option should be (2) by now. It works by allowing
userspace to put a string into /proc/sys/kernel/core_pattern like:
|/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
The "|" at the beginning indicates to the kernel that a pipe must be
used. The path following the pipe indicator is a path to a binary that
will be spawned as a usermode helper process. Any additional parameters
pass information about the task that is generating the coredump to the
binary that processes the coredump.
In the example core_pattern shown above systemd-coredump is spawned as a
usermode helper. There's various conceptual consequences of this
(non-exhaustive list):
- systemd-coredump is spawned with file descriptor number 0 (stdin)
connected to the read-end of the pipe. All other file descriptors are
closed. That specifically includes 1 (stdout) and 2 (stderr). This has
already caused bugs because userspace assumed that this cannot happen
(Whether or not this is a sane assumption is irrelevant.).
- systemd-coredump will be spawned as a child of system_unbound_wq. So
it is not a child of any userspace process and specifically not a
child of PID 1. It cannot be waited upon and is in a weird hybrid
upcall which are difficult for userspace to control correctly.
- systemd-coredump is spawned with full kernel privileges. This
necessitates all kinds of weird privilege dropping excercises in
userspace to make this safe.
- A new usermode helper has to be spawned for each crashing process.
This series adds a new mode:
(3) Dumping into an AF_UNIX socket.
Userspace can set /proc/sys/kernel/core_pattern to:
@/path/to/coredump.socket
The "@" at the beginning indicates to the kernel that an AF_UNIX
coredump socket will be used to process coredumps.
The coredump socket must be located in the initial mount namespace.
When a task coredumps it opens a client socket in the initial network
namespace and connects to the coredump socket.
- The coredump server uses SO_PEERPIDFD to get a stable handle on the
connected crashing task. The retrieved pidfd will provide a stable
reference even if the crashing task gets SIGKILLed while generating
the coredump.
- By setting core_pipe_limit non-zero userspace can guarantee that the
crashing task cannot be reaped behind it's back and thus process all
necessary information in /proc/<pid>. The SO_PEERPIDFD can be used to
detect whether /proc/<pid> still refers to the same process.
The core_pipe_limit isn't used to rate-limit connections to the
socket. This can simply be done via AF_UNIX sockets directly.
- The pidfd for the crashing task will grow new information how the task
coredumps.
- The coredump server should mark itself as non-dumpable.
- A container coredump server in a separate network namespace can simply
bind to another well-know address and systemd-coredump fowards
coredumps to the container.
- Coredumps could in the future also be handled via per-user/session
coredump servers that run only with that users privileges.
The coredump server listens on the coredump socket and accepts a
new coredump connection. It then retrieves SO_PEERPIDFD for the
client, inspects uid/gid and hands the accepted client to the users
own coredump handler which runs with the users privileges only
(It must of coure pay close attention to not forward crashing suid
binaries.).
The new coredump socket will allow userspace to not have to rely on
usermode helpers for processing coredumps and provides a safer way to
handle them instead of relying on super privileged coredumping helpers
that have and continue to cause significant CVEs.
This will also be significantly more lightweight since no fork()+exec()
for the usermodehelper is required for each crashing process. The
coredump server in userspace can e.g., just keep a worker pool.
Link: https://lore.kernel.org/20250516-work-coredump-socket-v8-4-664f3caf2516@kernel.org
Acked-by: Luca Boccassi <luca.boccassi@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Reviewed-by: Jann Horn <jannh@google.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The throttle support has been added in the generic code. Remove
the driver-specific throttle support.
Besides the throttle, perf_event_overflow may return true because of
event_limit. It already does an inatomic event disable. The pmu->stop
is not required either.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250520181644.2673067-17-kan.liang@linux.intel.com
|
|
The throttle support has been added in the generic code. Remove
the driver-specific throttle support.
Besides the throttle, perf_event_overflow may return true because of
event_limit. It already does an inatomic event disable. The pmu->stop
is not required either.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Max Filippov <jcmvbkbc@gmail.com>
Link: https://lore.kernel.org/r/20250520181644.2673067-16-kan.liang@linux.intel.com
|
|
The throttle support has been added in the generic code. Remove
the driver-specific throttle support.
Besides the throttle, perf_event_overflow may return true because of
event_limit. It already does an inatomic event disable. The pmu->stop
is not required either.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250520181644.2673067-15-kan.liang@linux.intel.com
|
|
The throttle support has been added in the generic code. Remove
the driver-specific throttle support.
Besides the throttle, perf_event_overflow may return true because of
event_limit. It already does an inatomic event disable. The pmu->stop
is not required either.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250520181644.2673067-14-kan.liang@linux.intel.com
|
|
The throttle support has been added in the generic code. Remove
the driver-specific throttle support.
Besides the throttle, perf_event_overflow may return true because of
event_limit. It already does an inatomic event disable. The pmu->stop
is not required either.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Guo Ren <guoren@kernel.org>
Link: https://lore.kernel.org/r/20250520181644.2673067-13-kan.liang@linux.intel.com
|
|
The throttle support has been added in the generic code. Remove
the driver-specific throttle support.
Besides the throttle, perf_event_overflow may return true because of
event_limit. It already does an inatomic event disable. The pmu->stop
is not required either.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vineet Gupta <vgupta@kernel.org>
Link: https://lore.kernel.org/r/20250520181644.2673067-12-kan.liang@linux.intel.com
|
|
The throttle support has been added in the generic code. Remove
the driver-specific throttle support.
Besides the throttle, perf_event_overflow may return true because of
event_limit. It already does an inatomic event disable. The pmu->stop
is not required either.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250520181644.2673067-11-kan.liang@linux.intel.com
|
|
The throttle support has been added in the generic code. Remove
the driver-specific throttle support.
Besides the throttle, perf_event_overflow may return true because of
event_limit. It already does an inatomic event disable. The pmu->stop
is not required either.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250520181644.2673067-10-kan.liang@linux.intel.com
|
|
The throttle support has been added in the generic code. Remove
the driver-specific throttle support.
Besides the throttle, perf_event_overflow may return true because of
event_limit. It already does an inatomic event disable. The pmu->stop
is not required either.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Leo Yan <leo.yan@arm.com>
Link: https://lore.kernel.org/r/20250520181644.2673067-9-kan.liang@linux.intel.com
|
|
The throttle support has been added in the generic code. Remove
the driver-specific throttle support.
Besides the throttle, perf_event_overflow may return true because of
event_limit. It already does an inatomic event disable. The pmu->stop
is not required either.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Thomas Richter <tmricht@linux.ibm.com>
Link: https://lore.kernel.org/r/20250520181644.2673067-8-kan.liang@linux.intel.com
|
|
The throttle support has been added in the generic code. Remove
the driver-specific throttle support.
Besides the throttle, perf_event_overflow may return true because of
event_limit. It already does an inatomic event disable. The pmu->stop
is not required either.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250520181644.2673067-7-kan.liang@linux.intel.com
|
|
The throttle support has been added in the generic code. Remove
the driver-specific throttle support.
Besides the throttle, perf_event_overflow may return true because of
event_limit. It already does an inatomic event disable. The pmu->stop
is not required either.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250520181644.2673067-6-kan.liang@linux.intel.com
|
|
The throttle support has been added in the generic code. Remove
the driver-specific throttle support.
Besides the throttle, perf_event_overflow may return true because of
event_limit. It already does an inatomic event disable. The pmu->stop
is not required either.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com>
Link: https://lore.kernel.org/r/20250520181644.2673067-5-kan.liang@linux.intel.com
|
|
The throttle support has been added in the generic code. Remove
the driver-specific throttle support.
Besides the throttle, perf_event_overflow may return true because of
event_limit. It already does an inatomic event disable. The pmu->stop
is not required either.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250520181644.2673067-4-kan.liang@linux.intel.com
|
|
The PERF_RECORD_THROTTLE records are dumped for all throttled events.
It's not necessary for group events, which are throttled altogether.
Optimize it by only dump the throttle log for the leader.
The sample right after the THROTTLE record must be generated by the
actual target event. It is good enough for the perf tool to locate the
actual target event.
Suggested-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20250520181644.2673067-3-kan.liang@linux.intel.com
|
|
The current throttle logic doesn't work well with a group, e.g., the
following sampling-read case.
$ perf record -e "{cycles,cycles}:S" ...
$ perf report -D | grep THROTTLE | tail -2
THROTTLE events: 426 ( 9.0%)
UNTHROTTLE events: 425 ( 9.0%)
$ perf report -D | grep PERF_RECORD_SAMPLE -a4 | tail -n 5
0 1020120874009167 0x74970 [0x68]: PERF_RECORD_SAMPLE(IP, 0x1):
... sample_read:
.... group nr 2
..... id 0000000000000327, value 000000000cbb993a, lost 0
..... id 0000000000000328, value 00000002211c26df, lost 0
The second cycles event has a much larger value than the first cycles
event in the same group.
The current throttle logic in the generic code only logs the THROTTLE
event. It relies on the specific driver implementation to disable
events. For all ARCHs, the implementation is similar. Only the event is
disabled, rather than the group.
The logic to disable the group should be generic for all ARCHs. Add the
logic in the generic code. The following patch will remove the buggy
driver-specific implementation.
The throttle only happens when an event is overflowed. Stop the entire
group when any event in the group triggers the throttle.
The MAX_INTERRUPTS is set to all throttle events.
The unthrottled could happen in 3 places.
- event/group sched. All events in the group are scheduled one by one.
All of them will be unthrottled eventually. Nothing needs to be
changed.
- The perf_adjust_freq_unthr_events for each tick. Needs to restart the
group altogether.
- The __perf_event_period(). The whole group needs to be restarted
altogether as well.
With the fix,
$ sudo perf report -D | grep PERF_RECORD_SAMPLE -a4 | tail -n 5
0 3573470770332 0x12f5f8 [0x70]: PERF_RECORD_SAMPLE(IP, 0x2):
... sample_read:
.... group nr 2
..... id 0000000000000a28, value 00000004fd3dfd8f, lost 0
..... id 0000000000000a29, value 00000004fd3dfd8f, lost 0
Suggested-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20250520181644.2673067-2-kan.liang@linux.intel.com
|
|
There is a spelling mistake in a fail error message. Fix it.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/r/20250520080657.30726-1-colin.i.king@gmail.com
|