Age | Commit message (Collapse) | Author |
|
The kernel sleep profile is no longer working due to a recursive locking
bug introduced by commit 42a20f86dc19 ("sched: Add wrapper for get_wchan()
to keep task blocked")
Booting with the 'profile=sleep' kernel command line option added or
executing
# echo -n sleep > /sys/kernel/profiling
after boot causes the system to lock up.
Lockdep reports
kthreadd/3 is trying to acquire lock:
ffff93ac82e08d58 (&p->pi_lock){....}-{2:2}, at: get_wchan+0x32/0x70
but task is already holding lock:
ffff93ac82e08d58 (&p->pi_lock){....}-{2:2}, at: try_to_wake_up+0x53/0x370
with the call trace being
lock_acquire+0xc8/0x2f0
get_wchan+0x32/0x70
__update_stats_enqueue_sleeper+0x151/0x430
enqueue_entity+0x4b0/0x520
enqueue_task_fair+0x92/0x6b0
ttwu_do_activate+0x73/0x140
try_to_wake_up+0x213/0x370
swake_up_locked+0x20/0x50
complete+0x2f/0x40
kthread+0xfb/0x180
However, since nobody noticed this regression for more than two years,
let's remove 'profile=sleep' support based on the assumption that nobody
needs this functionality.
Fixes: 42a20f86dc19 ("sched: Add wrapper for get_wchan() to keep task blocked")
Cc: stable@vger.kernel.org # v5.16+
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into for-6.12
Pull tip/sched/core to resolve the following four conflicts. While 2-4 are
simple context conflicts, 1 is a bit subtle and easy to resolve incorrectly.
1. 2c8d046d5d51 ("sched: Add normal_policy()")
vs.
faa42d29419d ("sched/fair: Make SCHED_IDLE entity be preempted in strict hierarchy")
The former converts direct test on p->policy to use the helper
normal_policy(). The latter moves the p->policy test to a different
location. Resolve by converting the test on p->plicy in the new location to
use normal_policy().
2. a7a9fc549293 ("sched_ext: Add boilerplate for extensible scheduler class")
vs.
a110a81c52a9 ("sched/deadline: Deferrable dl server")
Both add calls to put_prev_task_idle() and set_next_task_idle(). Simple
context conflict. Resolve by taking changes from both.
3. a7a9fc549293 ("sched_ext: Add boilerplate for extensible scheduler class")
vs.
c245910049d0 ("sched/core: Add clearing of ->dl_server in put_prev_task_balance()")
The former changes for_each_class() itertion to use for_each_active_class().
The latter moves away the adjacent dl_server handling code. Simple context
conflict. Resolve by taking changes from both.
4. 60c27fb59f6c ("sched_ext: Implement sched_ext_ops.cpu_online/offline()")
vs.
31b164e2e4af ("sched/smt: Introduce sched_smt_present_inc/dec() helper")
2f027354122f ("sched/core: Introduce sched_set_rq_on/offline() helper")
The former adds scx_rq_deactivate() call. The latter two change code around
it. Simple context conflict. Resolve by taking changes from both.
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
"Two fixes for the timer/clocksource code:
- The recent fix to make the take over of the broadcast timer more
reliable retrieves a per CPU pointer in preemptible context.
This went unnoticed in testing as some compilers hoist the access
into the non-preemotible section where the pointer is actually
used, but obviously compilers can rightfully invoke it where the
code put it.
Move it into the non-preemptible section right to the actual usage
side to cure it.
- The clocksource watchdog is supposed to emit a warning when the
retry count is greater than one and the number of retries reaches
the limit.
The condition is backwards and warns always when the count is
greater than one. Fixup the condition to prevent spamming dmesg"
* tag 'timers-urgent-2024-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clocksource: Fix brown-bag boolean thinko in cs_watchdog_read()
tick/broadcast: Move per CPU pointer access into the atomic section
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Thomas Gleixner:
- When stime is larger than rtime due to accounting imprecision, then
utime = rtime - stime becomes negative. As this is unsigned math, the
result becomes a huge positive number.
Cure it by resetting stime to rtime in that case, so utime becomes 0.
- Restore consistent state when sched_cpu_deactivate() fails.
When offlining a CPU fails in sched_cpu_deactivate() after the SMT
present counter has been decremented, then the function aborts but
fails to increment the SMT present counter and leaves it imbalanced.
Consecutive operations cause it to underflow. Add the missing fixup
for the error path.
For SMT accounting the runqueue needs to marked online again in the
error exit path to restore consistent state.
* tag 'sched-urgent-2024-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/core: Fix unbalance set_rq_online/offline() in sched_cpu_deactivate()
sched/core: Introduce sched_set_rq_on/offline() helper
sched/smt: Fix unbalance sched_smt_present dec/inc
sched/smt: Introduce sched_smt_present_inc/dec() helper
sched/cputime: Fix mul_u64_u64_div_u64() precision for cputime
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Thomas Gleixner:
"Two fixes for locking and jump labels:
- Ensure that the atomic_cmpxchg() conditions are correct and
evaluating to true on any non-zero value except 1. The missing
check of the return value leads to inconsisted state of the jump
label counter.
- Add a missing type conversion in the paravirt spinlock code which
makes loongson build again"
* tag 'locking-urgent-2024-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
jump_label: Fix the fix, brown paper bags galore
locking/pvqspinlock: Correct the type of "old" variable in pv_kick_node()
|
|
From 1232da7eced620537a78f19c8cf3d4a3508e2419 Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Wed, 31 Jul 2024 09:14:52 -1000
p->scx.disallow provides a way for the BPF scheduler to reject certain tasks
from attaching. It's currently allowed for both the load and fork paths;
however, the latter doesn't actually work as p->sched_class is already set
by the time scx_ops_init_task() is called during fork.
This is a convenience feature which is mostly useful from the load path
anyway. Allow it only from the load path.
v2: Trigger scx_ops_error() iff @p->policy == SCHED_EXT to make it a bit
easier for the BPF scheduler (David).
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: "Zhangqiao (2012 lab)" <zhangqiao22@huawei.com>
Link: http://lkml.kernel.org/r/20240711110720.1285-1-zhangqiao22@huawei.com
Fixes: 7bb6f0810ecf ("sched_ext: Allow BPF schedulers to disallow specific tasks from joining SCHED_EXT")
Acked-by: David Vernet <void@manifault.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Currently, the coredump_task_exit() function sets the task state
to TASK_UNINTERRUPTIBLE|TASK_FREEZABLE, which usually works well.
But a combination of large memory and slow (and/or highly contended)
mass storage can cause application core dumps to take more than
two minutes, which can cause check_hung_task(), which is invoked by
check_hung_uninterruptible_tasks(), to produce task-blocked splats.
There does not seem to be any reasonable benefit to getting these splats.
Furthermore, as Oleg Nesterov points out, TASK_UNINTERRUPTIBLE could
be misleading because the task sleeping in coredump_task_exit() really
is killable, albeit indirectly. See the check of signal->core_state
in prepare_signal() and the check of fatal_signal_pending()
in dump_interrupted(), which bypass the normal unkillability of
TASK_UNINTERRUPTIBLE, resulting in coredump_finish() invoking
wake_up_process() on any threads sleeping in coredump_task_exit().
Therefore, change that TASK_UNINTERRUPTIBLE to TASK_IDLE.
Reported-by: Anhad Jai Singh <ffledgling@meta.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: Rik van Riel <riel@surriel.com>
|
|
Right now, cs_watchdog_read() does clocksource sanity checks based
on WATCHDOG_MAX_SKEW, which sets a floor on any clocksource's
.uncertainty_margin. These sanity checks can therefore act
inappropriately for clocksources with large uncertainty margins.
One reason for a clocksource to have a large .uncertainty_margin is when
that clocksource has long read-out latency, given that it does not make
sense for the .uncertainty_margin to be smaller than the read-out latency.
With the current checks, cs_watchdog_read() could reject all normal
reads from a clocksource with long read-out latencies, such as those
from legacy clocksources that are no longer implemented in hardware.
Therefore, recast the cs_watchdog_read() checks in terms of the
.uncertainty_margin values of the clocksources involved in the timespan in
question. The first covers two watchdog reads and one cs read, so use
twice the watchdog .uncertainty_margin plus that of the cs. The second
covers only a pair of watchdog reads, so use twice the watchdog
.uncertainty_margin.
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240802154618.4149953-4-paulmck@kernel.org
|
|
The WATCHDOG_THRESHOLD macro is no longer used to supply a default value
for ->uncertainty_margin, but WATCHDOG_MAX_SKEW now is.
Therefore, update the comments to reflect this change.
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/all/20240802154618.4149953-3-paulmck@kernel.org
|
|
Add more detail on the rationale for bounding the clocksource
->uncertainty_margin below at about 500ppm.
Signed-off-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240802154618.4149953-1-paulmck@kernel.org
|
|
The current "nretries > 1 || nretries >= max_retries" check in
cs_watchdog_read() will always evaluate to true, and thus pr_warn(), if
nretries is greater than 1. The intent is instead to never warn on the
first try, but otherwise warn if the successful retry was the last retry.
Therefore, change that "||" to "&&".
Fixes: db3a34e17433 ("clocksource: Retry clock read if long delays detected")
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20240802154618.4149953-2-paulmck@kernel.org
|
|
CONFIG_HOTPLUG_PARALLEL expects the architecture to implement
arch_cpuhp_init_parallel_bringup() to decide whether paralllel hotplug is
possible and to do the necessary architecture specific initialization.
There are architectures which can enable it unconditionally and do not
require architecture specific initialization.
Provide a weak fallback for arch_cpuhp_init_parallel_bringup() so that
such architectures are not forced to implement empty stub functions.
Signed-off-by: Jiaxun Yang <jiaxun.yang@flygoat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240716-loongarch-hotplug-v3-2-af59b3bb35c8@flygoat.com
|
|
Provide stub functions for SMT related parallel bring up functions so that
HOTPLUG_PARALLEL can work without HOTPLUG_SMT.
Signed-off-by: Jiaxun Yang <jiaxun.yang@flygoat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240716-loongarch-hotplug-v3-1-af59b3bb35c8@flygoat.com
|
|
As Documentation/filesystems/sysfs.rst suggested, show() should only use
sysfs_emit() or sysfs_emit_at() when formatting the value to be returned
to user space.
No functional change intended.
Signed-off-by: Xueqin Luo <luoxueqin@kylinos.cn>
Link: https://patch.msgid.link/20240801083156.2513508-3-luoxueqin@kylinos.cn
[ rjw: Subject edit ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
As Documentation/filesystems/sysfs.rst suggested, show() should only
use sysfs_emit() or sysfs_emit_at() when formatting the value to be
returned to user space.
No functional change intended.
Signed-off-by: Xueqin Luo <luoxueqin@kylinos.cn>
Link: https://patch.msgid.link/20240801083156.2513508-2-luoxueqin@kylinos.cn
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Kill the extra get_uprobe() + put_uprobe() in uprobe_unregister() and
move the possibly final put_uprobe() from delete_uprobe() to its only
caller, uprobe_unregister().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20240801132749.GA8817@redhat.com
|
|
Fold __uprobe_unregister() into its single caller, uprobe_unregister().
A separate patch to simplify the next change.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20240801132744.GA8814@redhat.com
|
|
__uprobe_unregister()
If register_for_each_vma() fails uprobe_register() can safely drop
uprobe->register_rwsem and use uprobe_unregister(). There is no worry
about the races with another register/unregister, consumer_add() was
already called so this case doesn't differ from _unregister() right
after the successful _register().
Yes this means the extra up_write() + down_write(), but this is the
slow and unlikely case anyway.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20240801132739.GA8809@redhat.com
|
|
This way uprobe_unregister() and uprobe_apply() can use "struct uprobe *"
rather than inode + offset. This simplifies the code and allows to avoid
the unnecessary find_uprobe() + put_uprobe() in these functions.
TODO: uprobe_unregister() still needs get_uprobe/put_uprobe to ensure that
this uprobe can't be freed before up_write(&uprobe->register_rwsem).
Co-developed-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20240801132734.GA8803@redhat.com
|
|
It doesn't make any sense to have 2 versions of _register(). Note that
trace_uprobe_enable(), the only user of uprobe_register(), doesn't need
to check tu->ref_ctr_offset to decide which one should be used, it could
safely pass ref_ctr_offset == 0 to uprobe_register_refctr().
Add this argument to uprobe_register(), update the callers, and kill
uprobe_register_refctr().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240801132728.GA8800@redhat.com
|
|
Return -ENOMEM instead of NULL, which makes caller's error handling just
a touch simpler.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20240801132719.GA8788@redhat.com
|
|
get_user_pages_remote() and the comment above it make no sense.
There is no task_struct passed into get_user_pages_remote() anymore, and
nowadays mm_account_fault() increments the current->min/maj_flt counters
regardless of FAULT_FLAG_REMOTE.
Reported-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240801132714.GA8783@redhat.com
|
|
The comment above uprobe_write_opcode() is wrong, unapply_uprobe() calls
it under mmap_read_lock() and this is correct.
And it is completely unclear why register_for_each_vma() takes mmap_lock
for writing, add a comment to explain that mmap_write_lock() is needed to
avoid the following race:
- A task T hits the bp installed by uprobe and calls
find_active_uprobe()
- uprobe_unregister() removes this uprobe/bp
- T calls find_uprobe() which returns NULL
- another uprobe_register() installs the bp at the same address
- T calls is_trap_at_addr() which returns true
- T returns to handle_swbp() and gets SIGTRAP.
Reported-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20240801132709.GA8780@redhat.com
|
|
When tracing user functions with uprobe functionality, it's common to
install the probe (e.g., a BPF program) at the first instruction of the
function. This is often going to be `push %rbp` instruction in function
preamble, which means that within that function frame pointer hasn't
been established yet. This leads to consistently missing an actual
caller of the traced function, because perf_callchain_user() only
records current IP (capturing traced function) and then following frame
pointer chain (which would be caller's frame, containing the address of
caller's caller).
So when we have target_1 -> target_2 -> target_3 call chain and we are
tracing an entry to target_3, captured stack trace will report
target_1 -> target_3 call chain, which is wrong and confusing.
This patch proposes a x86-64-specific heuristic to detect `push %rbp`
(`push %ebp` on 32-bit architecture) instruction being traced. Given
entire kernel implementation of user space stack trace capturing works
under assumption that user space code was compiled with frame pointer
register (%rbp/%ebp) preservation, it seems pretty reasonable to use
this instruction as a strong indicator that this is the entry to the
function. In that case, return address is still pointed to by %rsp/%esp,
so we fetch it and add to stack trace before proceeding to unwind the
rest using frame pointer-based logic.
We also check for `endbr64` (for 64-bit modes) as another common pattern
for function entry, as suggested by Josh Poimboeuf. Even if we get this
wrong sometimes for uprobes attached not at the function entry, it's OK
because stack trace will still be overall meaningful, just with one
extra bogus entry. If we don't detect this, we end up with guaranteed to
be missing caller function entry in the stack trace, which is worse
overall.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20240729175223.23914-1-andrii@kernel.org
|
|
This change allows events to use PERF_SAMPLE_READ with inherit
so long as PERF_SAMPLE_TID is also set. This enables sample based
profiling of a group of counters over a hierarchy of processes or
threads. This is useful, for example, for collecting per-thread
counters/metrics, event based sampling of multiple counters as a unit,
access to the enabled and running time when using multiplexing and so
on.
Prior to this, users were restricted to either collecting aggregate
statistics for a multi-threaded/-process application (e.g. with
"perf stat"), or to sample individual threads, or to profile the entire
system (which requires root or CAP_PERFMON, and may produce much more
data than is required). Theoretically a tool could poll for or otherwise
monitor thread/process creation and construct whatever events the user
is interested in using perf_event_open, for each new thread or process,
but this is racy, can lead to file-descriptor exhaustion, and ultimately
just replicates the behaviour of inherit, but in userspace.
This configuration differs from inherit without PERF_SAMPLE_READ in that
the accumulated event count, and consequently any sample (such as if
triggered by overflow of sample_period) will be on a per-thread rather
than on an aggregate basis.
The meaning of read_format::value field of both PERF_RECORD_READ and
PERF_RECORD_SAMPLE is changed such that if the sampled event uses this
new configuration then the values reported will be per-thread rather
than the global aggregate value. This is a change from the existing
semantics of read_format (where PERF_SAMPLE_READ is used without
inherit), but it is necessary to expose the per-thread counter values,
and it avoids reinventing a separate "read_format_thread" field that
otherwise replicates the same behaviour. This change should not break
existing tools, since this configuration was not previously valid and
was rejected by the kernel. Tools that opt into this new mode will need
to account for this when calculating the counter delta for a given
sample. Tools that wish to have both the per-thread and aggregate value
can perform the global aggregation themselves from the per-thread
values.
The change to read_format::value does not affect existing valid
perf_event_attr configurations, nor does it change the behaviour of
calls to "read" on an event descriptor. Both continue to report the
aggregate value for the entire thread/process hierarchy. The difference
between the results reported by "read" and PERF_RECORD_SAMPLE in this
new configuration is justified on the basis that it is not (easily)
possible for "read" to target a specific thread (the caller only has
the fd for the original parent event).
Signed-off-by: Ben Gainey <ben.gainey@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20240730084417.7693-3-ben.gainey@arm.com
|
|
nr_pending counts the number of events in the context that
either pending_sigtrap or pending_work, but it is used
to prevent taking the fast path in perf_event_context_sched_out.
Renamed to reflect what it is used for, rather than what it
counts. This change allows using the field to track other
event properties that also require skipping the fast path
without possible confusion over the name.
Signed-off-by: Ben Gainey <ben.gainey@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20240730084417.7693-2-ben.gainey@arm.com
|
|
Fixes the following Coccinelle/coccicheck warning reported by
minmax.cocci:
WARNING opportunity for min()
Use const size_t instead of int for the result of min().
Compile-tested with CONFIG_KCSAN=y.
Reviewed-by: Marco Elver <elver@google.com>
Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
|
|
scx_dump_task() uses stack_trace_save_tsk() which is only available when
CONFIG_STACKTRACE. Make CONFIG_SCHED_CLASS_EXT select CONFIG_STACKTRACE if
the support is available and skip capturing stack trace if
!CONFIG_STACKTRACE.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202407161844.reewQQrR-lkp@intel.com/
Acked-by: David Vernet <void@manifault.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
We currently only allow calling sleepable scx kfuncs (i.e.
scx_bpf_create_dsq()) from BPF_PROG_TYPE_STRUCT_OPS progs. The idea here
was that we'd never have to call scx_bpf_create_dsq() outside of a
sched_ext struct_ops callback, but that might not actually be true. For
example, a scheduler could do something like the following:
1. Open and load (not yet attach) a scheduler skel
2. Synchronously call into a BPF_PROG_TYPE_SYSCALL prog from user space.
For example, to initialize an LLC domain, or some other global,
read-only state.
3. Attach the skel, which actually enables the scheduler
The advantage of doing this is that it can preclude having to do pretty
ugly boilerplate like initializing a read-only, statically sized array of
u64[]'s which the kernel consumes literally once at init time to then
create struct bpf_cpumask objects which are actually queried at runtime.
Doing the above is already possible given that we can invoke core BPF
kfuncs, such as bpf_cpumask_create(), from BPF_PROG_TYPE_SYSCALL progs. We
already allow many scx kfuncs to be called from BPF_PROG_TYPE_SYSCALL progs
(e.g. scx_bpf_kick_cpu()). Let's allow the sleepable kfuncs as well.
Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Cgroup subsystem state (CSS) is an abstraction in the cgroup layer to
help manage different structures in various cgroup subsystems by being
an embedded element inside a larger structure like cpuset or mem_cgroup.
The /proc/cgroups file shows the number of cgroups for each of the
subsystems. With cgroup v1, the number of CSSes is the same as the
number of cgroups. That is not the case anymore with cgroup v2. The
/proc/cgroups file cannot show the actual number of CSSes for the
subsystems that are bound to cgroup v2.
So if a v2 cgroup subsystem is leaking cgroups (usually memory cgroup),
we can't tell by looking at /proc/cgroups which cgroup subsystems may
be responsible.
As cgroup v2 had deprecated the use of /proc/cgroups, the hierarchical
cgroup.stat file is now being extended to show the number of live and
dying CSSes associated with all the non-inhibited cgroup subsystems that
have been bound to cgroup v2. The number includes CSSes in the current
cgroup as well as in all the descendants underneath it. This will help
us pinpoint which subsystems are responsible for the increasing number
of dying (nr_dying_descendants) cgroups.
The CSSes dying counts are stored in the cgroup structure itself
instead of inside the CSS as suggested by Johannes. This will allow
us to accurately track dying counts of cgroup subsystems that have
recently been disabled in a cgroup. It is now possible that a zero
subsystem number is coupled with a non-zero dying subsystem number.
The cgroup-v2.rst file is updated to discuss this new behavior.
With this patch applied, a sample output from root cgroup.stat file
was shown below.
nr_descendants 56
nr_subsys_cpuset 1
nr_subsys_cpu 43
nr_subsys_io 43
nr_subsys_memory 56
nr_subsys_perf_event 57
nr_subsys_hugetlb 1
nr_subsys_pids 56
nr_subsys_rdma 1
nr_subsys_misc 1
nr_dying_descendants 30
nr_dying_subsys_cpuset 0
nr_dying_subsys_cpu 0
nr_dying_subsys_io 0
nr_dying_subsys_memory 30
nr_dying_subsys_perf_event 0
nr_dying_subsys_hugetlb 0
nr_dying_subsys_pids 0
nr_dying_subsys_rdma 0
nr_dying_subsys_misc 0
Another sample output from system.slice/cgroup.stat was:
nr_descendants 34
nr_subsys_cpuset 0
nr_subsys_cpu 32
nr_subsys_io 32
nr_subsys_memory 34
nr_subsys_perf_event 35
nr_subsys_hugetlb 0
nr_subsys_pids 34
nr_subsys_rdma 0
nr_subsys_misc 0
nr_dying_descendants 30
nr_dying_subsys_cpuset 0
nr_dying_subsys_cpu 0
nr_dying_subsys_io 0
nr_dying_subsys_memory 30
nr_dying_subsys_perf_event 0
nr_dying_subsys_hugetlb 0
nr_dying_subsys_pids 0
nr_dying_subsys_rdma 0
nr_dying_subsys_misc 0
Note that 'debug' controller wasn't used to provide this information because
the controller is not recommended in productions kernels, also many of them
won't enable CONFIG_CGROUP_DEBUG by default.
Similar information could be retrieved with debuggers like drgn but that's
also not always available (e.g. lockdown) and the additional cost of runtime
tracking here is deemed marginal.
tj: Added Michal's paragraphs on why this is not added the debug controller
to the commit message.
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Kamalesh Babulal <kamalesh.babulal@oracle.com>
Cc: Michal Koutný <mkoutny@suse.com>
Link: http://lkml.kernel.org/r/20240715150034.2583772-1-longman@redhat.com
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Per the example of:
!atomic_cmpxchg(&key->enabled, 0, 1)
the inverse was written as:
atomic_cmpxchg(&key->enabled, 1, 0)
except of course, that while !old is only true for old == 0, old is
true for everything except old == 0.
Fix it to read:
atomic_cmpxchg(&key->enabled, 1, 0) == 1
such that only the 1->0 transition returns true and goes on to disable
the keys.
Fixes: 83ab38ef0a0b ("jump_label: Fix concurrency issues in static_key_slow_dec()")
Reported-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lkml.kernel.org/r/20240731105557.GY33588@noisy.programming.kicks-ass.net
|
|
The recent fix for making the take over of the broadcast timer more
reliable retrieves a per CPU pointer in preemptible context.
This went unnoticed as compilers hoist the access into the non-preemptible
region where the pointer is actually used. But of course it's valid that
the compiler keeps it at the place where the code puts it which rightfully
triggers:
BUG: using smp_processor_id() in preemptible [00000000] code:
caller is hotplug_cpu__broadcast_tick_pull+0x1c/0xc0
Move it to the actual usage site which is in a non-preemptible region.
Fixes: f7d43dd206e7 ("tick/broadcast: Make takeover of broadcast hrtimer reliable")
Reported-by: David Wang <00107082@163.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Yu Liao <liaoyu15@huawei.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/87ttg56ers.ffs@tglx
|
|
The process of constructing scheduling domains
involves multiple loops and repeated evaluations, leading to numerous
redundant and ineffective assessments that impact code efficiency.
Here, we use union-find to optimize the merging of cpumasks. By employing
path compression and union by rank, we effectively reduce the number of
lookups and merge comparisons.
Signed-off-by: Xavier <xavier_qy@163.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
There are several functions to decrease attach_in_progress, and they
will wake up cpuset_attach_wq when attach_in_progress is zero. So,
add a helper to make it concise.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Reviewed-by: Kamalesh Babulal <kamalesh.babulal@oracle.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Since the SLAB implementation was removed in v6.8, so the
cpuset_slab_spread_rotor is no longer used and can be removed.
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Currently when a process in a group forks and fails due to it's
parent's max restriction, all the cgroups from 'pids_forking' to root
will generate event notifications but only the cgroups from
'pids_over_limit' to root will increase the counter of PIDCG_MAX.
Consider this scenario: there are 4 groups A, B, C,and D, the
relationships are as follows, and user is watching on C.pids.events.
root->A->B->C->D
When a process in D forks and fails due to B.max restriction, the
user will get a spurious event notification because when he wakes up
and reads C.pids.events, he will find that the content has not changed.
To address this issue, only the cgroups from 'pids_over_limit' to root
will have their PIDCG_MAX counters increased and event notifications
generated.
Fixes: 385a635cacfe ("cgroup/pids: Make event counters hierarchical")
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
The child_ecpus_count variable was previously used to update
sibling cpumask when parent's effective_cpus is updated. However, it became
obsolete after commit e2ffe502ba45 ("cgroup/cpuset: Add
cpuset.cpus.exclusive for v2"). It should be removed.
tj: Restored {} for style consistency.
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Linux 6.11-rc1
|
|
For some reason I didn't see this issue on my arm64 or x86-64 builds,
but Stephen Rothwell reports that commit 2accfdb7eff6 ("profiling:
attempt to remove per-cpu profile flip buffer") left these static
variables around, and the powerpc build is unhappy about them:
kernel/profile.c:52:28: warning: 'cpu_profile_flip' defined but not used [-Wunused-variable]
52 | static DEFINE_PER_CPU(int, cpu_profile_flip);
| ^~~~~~~~~~~~~~~~
..
So remove these stale left-over remnants too.
Fixes: 2accfdb7eff6 ("profiling: attempt to remove per-cpu profile flip buffer")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The function bpf_get_smp_processor_id() is processed in a different
way, depending on the arch:
- on x86 verifier replaces call to bpf_get_smp_processor_id() with a
sequence of instructions that modify only r0;
- on riscv64 jit replaces call to bpf_get_smp_processor_id() with a
sequence of instructions that modify only r0;
- on arm64 jit replaces call to bpf_get_smp_processor_id() with a
sequence of instructions that modify only r0 and tmp registers.
These rewrites satisfy attribute no_caller_saved_registers contract.
Allow rewrite of no_caller_saved_registers patterns for
bpf_get_smp_processor_id() in order to use this function as a canary
for no_caller_saved_registers tests.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240722233844.1406874-4-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
|
|
GCC and LLVM define a no_caller_saved_registers function attribute.
This attribute means that function scratches only some of
the caller saved registers defined by ABI.
For BPF the set of such registers could be defined as follows:
- R0 is scratched only if function is non-void;
- R1-R5 are scratched only if corresponding parameter type is defined
in the function prototype.
This commit introduces flag bpf_func_prot->allow_nocsr.
If this flag is set for some helper function, verifier assumes that
it follows no_caller_saved_registers calling convention.
The contract between kernel and clang allows to simultaneously use
such functions and maintain backwards compatibility with old
kernels that don't understand no_caller_saved_registers calls
(nocsr for short):
- clang generates a simple pattern for nocsr calls, e.g.:
r1 = 1;
r2 = 2;
*(u64 *)(r10 - 8) = r1;
*(u64 *)(r10 - 16) = r2;
call %[to_be_inlined]
r2 = *(u64 *)(r10 - 16);
r1 = *(u64 *)(r10 - 8);
r0 = r1;
r0 += r2;
exit;
- kernel removes unnecessary spills and fills, if called function is
inlined by verifier or current JIT (with assumption that patch
inserted by verifier or JIT honors nocsr contract, e.g. does not
scratch r3-r5 for the example above), e.g. the code above would be
transformed to:
r1 = 1;
r2 = 2;
call %[to_be_inlined]
r0 = r1;
r0 += r2;
exit;
Technically, the transformation is split into the following phases:
- function mark_nocsr_patterns(), called from bpf_check()
searches and marks potential patterns in instruction auxiliary data;
- upon stack read or write access,
function check_nocsr_stack_contract() is used to verify if
stack offsets, presumably reserved for nocsr patterns, are used
only from those patterns;
- function remove_nocsr_spills_fills(), called from bpf_check(),
applies the rewrite for valid patterns.
See comment in mark_nocsr_pattern_for_call() for more details.
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240722233844.1406874-3-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
|
|
Extract the part of check_helper_call() as a utility function allowing
to query 'struct bpf_func_proto' for a specific helper function id.
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240722233844.1406874-2-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
|
|
With latest llvm19, the selftest iters/iter_arr_with_actual_elem_count
failed with -mcpu=v4.
The following are the details:
0: R1=ctx() R10=fp0
; int iter_arr_with_actual_elem_count(const void *ctx) @ iters.c:1420
0: (b4) w7 = 0 ; R7_w=0
; int i, n = loop_data.n, sum = 0; @ iters.c:1422
1: (18) r1 = 0xffffc90000191478 ; R1_w=map_value(map=iters.bss,ks=4,vs=1280,off=1144)
3: (61) r6 = *(u32 *)(r1 +128) ; R1_w=map_value(map=iters.bss,ks=4,vs=1280,off=1144) R6_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
; if (n > ARRAY_SIZE(loop_data.data)) @ iters.c:1424
4: (26) if w6 > 0x20 goto pc+27 ; R6_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=32,var_off=(0x0; 0x3f))
5: (bf) r8 = r10 ; R8_w=fp0 R10=fp0
6: (07) r8 += -8 ; R8_w=fp-8
; bpf_for(i, 0, n) { @ iters.c:1427
7: (bf) r1 = r8 ; R1_w=fp-8 R8_w=fp-8
8: (b4) w2 = 0 ; R2_w=0
9: (bc) w3 = w6 ; R3_w=scalar(id=1,smin=smin32=0,smax=umax=smax32=umax32=32,var_off=(0x0; 0x3f)) R6_w=scalar(id=1,smin=smin32=0,smax=umax=smax32=umax32=32,var_off=(0x0; 0x3f))
10: (85) call bpf_iter_num_new#45179 ; R0=scalar() fp-8=iter_num(ref_id=2,state=active,depth=0) refs=2
11: (bf) r1 = r8 ; R1=fp-8 R8=fp-8 refs=2
12: (85) call bpf_iter_num_next#45181 13: R0=rdonly_mem(id=3,ref_obj_id=2,sz=4) R6=scalar(id=1,smin=smin32=0,smax=umax=smax32=umax32=32,var_off=(0x0; 0x3f)) R7=0 R8=fp-8 R10=fp0 fp-8=iter_num(ref_id=2,state=active,depth=1) refs=2
; bpf_for(i, 0, n) { @ iters.c:1427
13: (15) if r0 == 0x0 goto pc+2 ; R0=rdonly_mem(id=3,ref_obj_id=2,sz=4) refs=2
14: (81) r1 = *(s32 *)(r0 +0) ; R0=rdonly_mem(id=3,ref_obj_id=2,sz=4) R1_w=scalar(smin=0xffffffff80000000,smax=0x7fffffff) refs=2
15: (ae) if w1 < w6 goto pc+4 20: R0=rdonly_mem(id=3,ref_obj_id=2,sz=4) R1=scalar(smin=0xffffffff80000000,smax=smax32=umax32=31,umax=0xffffffff0000001f,smin32=0,var_off=(0x0; 0xffffffff0000001f)) R6=scalar(id=1,smin=umin=smin32=umin32=1,smax=umax=smax32=umax32=32,var_off=(0x0; 0x3f)) R7=0 R8=fp-8 R10=fp0 fp-8=iter_num(ref_id=2,state=active,depth=1) refs=2
; sum += loop_data.data[i]; @ iters.c:1429
20: (67) r1 <<= 2 ; R1_w=scalar(smax=0x7ffffffc0000007c,umax=0xfffffffc0000007c,smin32=0,smax32=umax32=124,var_off=(0x0; 0xfffffffc0000007c)) refs=2
21: (18) r2 = 0xffffc90000191478 ; R2_w=map_value(map=iters.bss,ks=4,vs=1280,off=1144) refs=2
23: (0f) r2 += r1
math between map_value pointer and register with unbounded min value is not allowed
The source code:
int iter_arr_with_actual_elem_count(const void *ctx)
{
int i, n = loop_data.n, sum = 0;
if (n > ARRAY_SIZE(loop_data.data))
return 0;
bpf_for(i, 0, n) {
/* no rechecking of i against ARRAY_SIZE(loop_data.n) */
sum += loop_data.data[i];
}
return sum;
}
The insn #14 is a sign-extenstion load which is related to 'int i'.
The insn #15 did a subreg comparision. Note that smin=0xffffffff80000000 and this caused later
insn #23 failed verification due to unbounded min value.
Actually insn #15 R1 smin range can be better. Before insn #15, we have
R1_w=scalar(smin=0xffffffff80000000,smax=0x7fffffff)
With the above range, we know for R1, upper 32bit can only be 0xffffffff or 0.
Otherwise, the value range for R1 could be beyond [smin=0xffffffff80000000,smax=0x7fffffff].
After insn #15, for the true patch, we know smin32=0 and smax32=32. With the upper 32bit 0xffffffff,
then the corresponding value is [0xffffffff00000000, 0xffffffff00000020]. The range is
obviously beyond the original range [smin=0xffffffff80000000,smax=0x7fffffff] and the
range is not possible. So the upper 32bit must be 0, which implies smin = smin32 and
smax = smax32.
This patch fixed the issue by adding additional register deduction after 32-bit compare
insn. If the signed 32-bit register range is non-negative then 64-bit smin is
in range of [S32_MIN, S32_MAX], then the actual 64-bit smin/smax should be the same
as 32-bit smin32/smax32.
With this patch, iters/iter_arr_with_actual_elem_count succeeded with better register range:
from 15 to 20: R0=rdonly_mem(id=7,ref_obj_id=2,sz=4) R1_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=31,var_off=(0x0; 0x1f)) R6=scalar(id=1,smin=umin=smin32=umin32=1,smax=umax=smax32=umax32=32,var_off=(0x0; 0x3f)) R7=scalar(id=9,smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R8=scalar(id=9,smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R10=fp0 fp-8=iter_num(ref_id=2,state=active,depth=3) refs=2
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240723162933.2731620-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
|
|
syzbot reported a kernel crash due to
commit 1f1e864b6555 ("bpf: Handle sign-extenstin ctx member accesses").
The reason is due to sign-extension of 32-bit load for
packet data/data_end/data_meta uapi field.
The original code looks like:
r2 = *(s32 *)(r1 + 76) /* load __sk_buff->data */
r3 = *(u32 *)(r1 + 80) /* load __sk_buff->data_end */
r0 = r2
r0 += 8
if r3 > r0 goto +1
...
Note that __sk_buff->data load has 32-bit sign extension.
After verification and convert_ctx_accesses(), the final asm code looks like:
r2 = *(u64 *)(r1 +208)
r2 = (s32)r2
r3 = *(u64 *)(r1 +80)
r0 = r2
r0 += 8
if r3 > r0 goto pc+1
...
Note that 'r2 = (s32)r2' may make the kernel __sk_buff->data address invalid
which may cause runtime failure.
Currently, in C code, typically we have
void *data = (void *)(long)skb->data;
void *data_end = (void *)(long)skb->data_end;
...
and it will generate
r2 = *(u64 *)(r1 +208)
r3 = *(u64 *)(r1 +80)
r0 = r2
r0 += 8
if r3 > r0 goto pc+1
If we allow sign-extension,
void *data = (void *)(long)(int)skb->data;
void *data_end = (void *)(long)skb->data_end;
...
the generated code looks like
r2 = *(u64 *)(r1 +208)
r2 <<= 32
r2 s>>= 32
r3 = *(u64 *)(r1 +80)
r0 = r2
r0 += 8
if r3 > r0 goto pc+1
and this will cause verification failure since "r2 <<= 32" is not allowed
as "r2" is a packet pointer.
To fix this issue for case
r2 = *(s32 *)(r1 + 76) /* load __sk_buff->data */
this patch added additional checking in is_valid_access() callback
function for packet data/data_end/data_meta access. If those accesses
are with sign-extenstion, the verification will fail.
[1] https://lore.kernel.org/bpf/000000000000c90eee061d236d37@google.com/
Reported-by: syzbot+ad9ec60c8eaf69e6f99c@syzkaller.appspotmail.com
Fixes: 1f1e864b6555 ("bpf: Handle sign-extenstin ctx member accesses")
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240723153439.2429035-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
|
|
After checking lsm hook return range in verifier, the test case
"test_progs -t test_lsm" failed, and the failure log says:
libbpf: prog 'test_int_hook': BPF program load failed: Invalid argument
libbpf: prog 'test_int_hook': -- BEGIN PROG LOAD LOG --
0: R1=ctx() R10=fp0
; int BPF_PROG(test_int_hook, struct vm_area_struct *vma, @ lsm.c:89
0: (79) r0 = *(u64 *)(r1 +24) ; R0_w=scalar(smin=smin32=-4095,smax=smax32=0) R1=ctx()
[...]
24: (b4) w0 = -1 ; R0_w=0xffffffff
; int BPF_PROG(test_int_hook, struct vm_area_struct *vma, @ lsm.c:89
25: (95) exit
At program exit the register R0 has smin=4294967295 smax=4294967295 should have been in [-4095, 0]
It can be seen that instruction "w0 = -1" zero extended -1 to 64-bit
register r0, setting both smin and smax values of r0 to 4294967295.
This resulted in a false reject when r0 was checked with range [-4095, 0].
Given bpf lsm does not return 64-bit values, this patch fixes it by changing
the compare between r0 and return range from 64-bit operation to 32-bit
operation for bpf lsm.
Fixes: 8fa4ecd49b81 ("bpf: enforce exact retval range on subprog/callback exit")
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/r/20240719110059.797546-5-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
|
|
bpf progs can be attached to kernel functions, and the attached functions
can take different parameters or return different return values. If
prog attached to one kernel function tail calls prog attached to another
kernel function, the ctx access or return value verification could be
bypassed.
For example, if prog1 is attached to func1 which takes only 1 parameter
and prog2 is attached to func2 which takes two parameters. Since verifier
assumes the bpf ctx passed to prog2 is constructed based on func2's
prototype, verifier allows prog2 to access the second parameter from
the bpf ctx passed to it. The problem is that verifier does not prevent
prog1 from passing its bpf ctx to prog2 via tail call. In this case,
the bpf ctx passed to prog2 is constructed from func1 instead of func2,
that is, the assumption for ctx access verification is bypassed.
Another example, if BPF LSM prog1 is attached to hook file_alloc_security,
and BPF LSM prog2 is attached to hook bpf_lsm_audit_rule_known. Verifier
knows the return value rules for these two hooks, e.g. it is legal for
bpf_lsm_audit_rule_known to return positive number 1, and it is illegal
for file_alloc_security to return positive number. So verifier allows
prog2 to return positive number 1, but does not allow prog1 to return
positive number. The problem is that verifier does not prevent prog1
from calling prog2 via tail call. In this case, prog2's return value 1
will be used as the return value for prog1's hook file_alloc_security.
That is, the return value rule is bypassed.
This patch adds restriction for tail call to prevent such bypasses.
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20240719110059.797546-4-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
|
|
A bpf prog returning a positive number attached to file_alloc_security
hook makes kernel panic.
This happens because file system can not filter out the positive number
returned by the LSM prog using IS_ERR, and misinterprets this positive
number as a file pointer.
Given that hook file_alloc_security never returned positive number
before the introduction of BPF LSM, and other BPF LSM hooks may
encounter similar issues, this patch adds LSM return value check
in verifier, to ensure no unexpected value is returned.
Fixes: 520b7aa00d8c ("bpf: lsm: Initialize the BPF LSM hooks")
Reported-by: Xin Liu <liuxin350@huawei.com>
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240719110059.797546-3-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
|
|
Add a disabled hooks list for BPF LSM. progs being attached to the
listed hooks will be rejected by the verifier.
Suggested-by: KP Singh <kpsingh@kernel.org>
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20240719110059.797546-2-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
|
|
These really can be handled gracefully without killing the machine.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
The task pointer which is handed to dequeue_signal() is always current. The
argument along with the first comment about signalfd in that function is
confusing at best. Remove it and use current internally.
Update the stale comment for dequeue_signal() while at it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|