Age | Commit message (Collapse) | Author |
|
In order to prepare introducing these symbols into the global
namespace; rename:
s/mark_wake_futex/futex_wake_mark/g
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: André Almeida <andrealmeid@collabora.com>
Link: https://lore.kernel.org/r/20210923171111.300673-13-andrealmeid@collabora.com
|
|
In order to prepare introducing these symbols into the global
namespace; rename:
s/match_futex/futex_match/g
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: André Almeida <andrealmeid@collabora.com>
Link: https://lore.kernel.org/r/20210923171111.300673-12-andrealmeid@collabora.com
|
|
In order to prepare introducing these symbols into the global
namespace; rename them:
s/hb_waiters_/futex_&/g
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: André Almeida <andrealmeid@collabora.com>
Link: https://lore.kernel.org/r/20210923171111.300673-11-andrealmeid@collabora.com
|
|
Move the PI futex implementation into it's own file.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: André Almeida <andrealmeid@collabora.com>
Link: https://lore.kernel.org/r/20210923171111.300673-10-andrealmeid@collabora.com
|
|
In order to prepare introducing these symbols into the global
namespace; rename them:
s/\<\([^_ ]*\)_futex_value_locked/futex_\1_value_locked/g
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: André Almeida <andrealmeid@collabora.com>
Link: https://lore.kernel.org/r/20210923171111.300673-9-andrealmeid@collabora.com
|
|
In order to prepare introducing these symbols into the global
namespace; rename:
s/hash_futex/futex_hash/g
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: André Almeida <andrealmeid@collabora.com>
Link: https://lore.kernel.org/r/20210923171111.300673-8-andrealmeid@collabora.com
|
|
In order to prepare introducing these symbols into the global
namespace; rename:
s/__unqueue_futex/__futex_unqueue/g
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: André Almeida <andrealmeid@collabora.com>
Link: https://lore.kernel.org/r/20210923171111.300673-7-andrealmeid@collabora.com
|
|
In order to prepare introducing these symbols into the global
namespace; rename them:
s/queue_\(un\)*lock/futex_q_\1lock/g
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: André Almeida <andrealmeid@collabora.com>
Link: https://lore.kernel.org/r/20210923171111.300673-6-andrealmeid@collabora.com
|
|
In order to prepare introducing these symbols into the global
namespace; rename them:
s/futex_wait_queue_me/futex_wait_queue/g
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: André Almeida <andrealmeid@collabora.com>
Link: https://lore.kernel.org/r/20210923171111.300673-5-andrealmeid@collabora.com
|
|
In order to prepare introducing these symbols into the global
namespace; rename them:
s/\<\(__\)*\(un\)*queue_me/\1futex_\2queue/g
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: André Almeida <andrealmeid@collabora.com>
Link: https://lore.kernel.org/r/20210923171111.300673-4-andrealmeid@collabora.com
|
|
Put the syscalls in their own little file.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: André Almeida <andrealmeid@collabora.com>
Link: https://lore.kernel.org/r/20210923171111.300673-3-andrealmeid@collabora.com
|
|
In preparation for splitup..
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: André Almeida <andrealmeid@collabora.com>
Link: https://lore.kernel.org/r/20210923171111.300673-2-andrealmeid@collabora.com
|
|
Instead of a full barrier around the Rmw insn, micro-optimize
for weakly ordered archs such that we only provide the required
ACQUIRE semantics when taking the read lock.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://lkml.kernel.org/r/20210920052031.54220-2-dave@stgolabs.net
|
|
Pull in dependencies.
|
|
Fixup conflicts.
# Conflicts:
# tools/objtool/check.c
|
|
Rename coredump_exit_mm to coredump_task_exit and call it from do_exit
before PTRACE_EVENT_EXIT, and before any cleanup work for a task
happens. This ensures that an accurate copy of the process can be
captured in the coredump as no cleanup for the process happens before
the coredump completes. This also ensures that PTRACE_EVENT_EXIT
will not be visited by any thread until the coredump is complete.
Add a new flag PF_POSTCOREDUMP so that tasks that have passed through
coredump_task_exit can be recognized and ignored in zap_process.
Now that all of the coredumping happens before exit_mm remove code to
test for a coredump in progress from mm_release.
Replace "may_ptrace_stop()" with a simple test of "current->ptrace".
The other tests in may_ptrace_stop all concern avoiding stopping
during a coredump. These tests are no longer necessary as it is now
guaranteed that fatal_signal_pending will be set if the code enters
ptrace_stop during a coredump. The code in ptrace_stop is guaranteed
not to stop if fatal_signal_pending returns true.
Until this change "ptrace_event(PTRACE_EVENT_EXIT)" could call
ptrace_stop without fatal_signal_pending being true, as signals are
dequeued in get_signal before calling do_exit. This is no longer
an issue as "ptrace_event(PTRACE_EVENT_EXIT)" is no longer reached
until after the coredump completes.
Link: https://lkml.kernel.org/r/874kaax26c.fsf@disp2133
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
|
|
Separate the coredump logic from the ordinary exit_mm logic
by moving the coredump logic out of exit_mm into it's own
function coredump_exit_mm.
Link: https://lkml.kernel.org/r/87a6k2x277.fsf@disp2133
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
|
|
Both arch_ptrace_stop_needed and arch_ptrace_stop are called with an
exit_code and a siginfo structure. Neither argument is used by any of
the implementations so just remove the unneeded arguments.
The two arechitectures that implement arch_ptrace_stop are ia64 and
sparc. Both architectures flush their register stacks before a
ptrace_stack so that all of the register information can be accessed
by debuggers.
As the question of if a register stack needs to be flushed is
independent of why ptrace is stopping not needing arguments make sense.
Cc: David Miller <davem@davemloft.net>
Cc: sparclinux@vger.kernel.org
Link: https://lkml.kernel.org/r/87lf3mx290.fsf@disp2133
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
|
|
The existence of sigkill_pending is a little silly as it is
functionally a duplicate of fatal_signal_pending that is used in
exactly one place.
Checking for pending fatal signals and returning early in ptrace_stop
is actively harmful. It casues the ptrace_stop called by
ptrace_signal to return early before setting current->exit_code.
Later when ptrace_signal reads the signal number from
current->exit_code is undefined, making it unpredictable what will
happen.
Instead rely on the fact that schedule will not sleep if there is a
pending signal that can awaken a task.
Removing the explict sigkill_pending test fixes fixes ptrace_signal
when ptrace_stop does not stop because current->exit_code is always
set to to signr.
Cc: stable@vger.kernel.org
Fixes: 3d749b9e676b ("ptrace: simplify ptrace_stop()->sigkill_pending() path")
Fixes: 1a669c2f16d4 ("Add arch_ptrace_stop")
Link: https://lkml.kernel.org/r/87pmsyx29t.fsf@disp2133
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
|
|
When !SCHEDSTATS schedstat_enabled() is an unconditional 0 and the
whole block doesn't exist, however GCC figures the scoped variable
'stats' is unused and complains about it.
Upgrade the warning from -Wunused-variable to -Wunused-but-set-variable
by writing it in two statements. This fixes the build because the new
warning is in W=1.
Given that whole if(0) {} thing, I don't feel motivated to change
things overly much and quite strongly feel this is the compiler being
daft.
Fixes: cb3e971c435d ("sched: Make struct sched_statistics independent of fair sched class")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
|
Similarly to 09772d92cd5a ("bpf: avoid retpoline for
lookup/update/delete calls on maps") and 84430d4232c3 ("bpf, verifier:
avoid retpoline for map push/pop/peek operation") avoid indirect call
while calling bpf_for_each_map_elem.
Before (a program fragment):
; if (rules_map) {
142: (15) if r4 == 0x0 goto pc+8
143: (bf) r3 = r10
; bpf_for_each_map_elem(rules_map, process_each_rule, &ctx, 0);
144: (07) r3 += -24
145: (bf) r1 = r4
146: (18) r2 = subprog[+5]
148: (b7) r4 = 0
149: (85) call bpf_for_each_map_elem#143680 <-- indirect call via
helper
After (same program fragment):
; if (rules_map) {
142: (15) if r4 == 0x0 goto pc+8
143: (bf) r3 = r10
; bpf_for_each_map_elem(rules_map, process_each_rule, &ctx, 0);
144: (07) r3 += -24
145: (bf) r1 = r4
146: (18) r2 = subprog[+5]
148: (b7) r4 = 0
149: (85) call bpf_for_each_array_elem#170336 <-- direct call
On a benchmark that calls bpf_for_each_map_elem() once and does many
other things (mostly checking fields in skb) with CONFIG_RETPOLINE=y it
makes program faster.
Before:
============================================================================
Benchmark.cpp time/iter iters/s
============================================================================
IngressMatchByRemoteEndpoint 80.78ns 12.38M
IngressMatchByRemoteIP 80.66ns 12.40M
IngressMatchByRemotePort 80.87ns 12.37M
After:
============================================================================
Benchmark.cpp time/iter iters/s
============================================================================
IngressMatchByRemoteEndpoint 73.49ns 13.61M
IngressMatchByRemoteIP 71.48ns 13.99M
IngressMatchByRemotePort 70.39ns 14.21M
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211006001838.75607-1-rdna@fb.com
|
|
This adds selftests that tests the success and failure path for modules
kfuncs (in presence of invalid kfunc calls) for both libbpf and
gen_loader. It also adds a prog_test kfunc_btf_id_list so that we can
add module BTF ID set from bpf_testmod.
This also introduces a couple of test cases to verifier selftests for
validating whether we get an error or not depending on if invalid kfunc
call remains after elimination of unreachable instructions.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211002011757.311265-10-memxor@gmail.com
|
|
This commit moves BTF ID lookup into the newly added registration
helper, in a way that the bbr, cubic, and dctcp implementation set up
their sets in the bpf_tcp_ca kfunc_btf_set list, while the ones not
dependent on modules are looked up from the wrapper function.
This lifts the restriction for them to be compiled as built in objects,
and can be loaded as modules if required. Also modify Makefile.modfinal
to call resolve_btfids for each module.
Note that since kernel kfunc_ids never overlap with module kfunc_ids, we
only match the owner for module btf id sets.
See following commits for background on use of:
CONFIG_X86 ifdef:
569c484f9995 (bpf: Limit static tcp-cc functions in the .BTF_ids list to x86)
CONFIG_DYNAMIC_FTRACE ifdef:
7aae231ac93b (bpf: tcp: Limit calling some tcp cc functions to CONFIG_DYNAMIC_FTRACE)
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211002011757.311265-6-memxor@gmail.com
|
|
This adds helpers for registering btf_id_set from modules and the
bpf_check_mod_kfunc_call callback that can be used to look them up.
With in kernel sets, the way this is supposed to work is, in kernel
callback looks up within the in-kernel kfunc whitelist, and then defers
to the dynamic BTF set lookup if it doesn't find the BTF id. If there is
no in-kernel BTF id set, this callback can be used directly.
Also fix includes for btf.h and bpfptr.h so that they can included in
isolation. This is in preparation for their usage in tcp_bbr, tcp_cubic
and tcp_dctcp modules in the next patch.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211002011757.311265-4-memxor@gmail.com
|
|
This patch also modifies the BPF verifier to only return error for
invalid kfunc calls specially marked by userspace (with insn->imm == 0,
insn->off == 0) after the verifier has eliminated dead instructions.
This can be handled in the fixup stage, and skip processing during add
and check stages.
If such an invalid call is dropped, the fixup stage will not encounter
insn->imm as 0, otherwise it bails out and returns an error.
This will be exposed as weak ksym support in libbpf in later patches.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211002011757.311265-3-memxor@gmail.com
|
|
This change adds support on the kernel side to allow for BPF programs to
call kernel module functions. Userspace will prepare an array of module
BTF fds that is passed in during BPF_PROG_LOAD using fd_array parameter.
In the kernel, the module BTFs are placed in the auxilliary struct for
bpf_prog, and loaded as needed.
The verifier then uses insn->off to index into the fd_array. insn->off
0 is reserved for vmlinux BTF (for backwards compat), so userspace must
use an fd_array index > 0 for module kfunc support. kfunc_btf_tab is
sorted based on offset in an array, and each offset corresponds to one
descriptor, with a max limit up to 256 such module BTFs.
We also change existing kfunc_tab to distinguish each element based on
imm, off pair as each such call will now be distinct.
Another change is to check_kfunc_call callback, which now include a
struct module * pointer, this is to be used in later patch such that the
kfunc_id and module pointer are matched for dynamically registered BTF
sets from loadable modules, so that same kfunc_id in two modules doesn't
lead to check_kfunc_call succeeding. For the duration of the
check_kfunc_call, the reference to struct module exists, as it returns
the pointer stored in kfunc_btf_tab.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211002011757.311265-2-memxor@gmail.com
|
|
When the trace_pid_list was created, the default pid max was 32768.
Creating a bitmask that can hold one bit for all 32768 took up 4096 (one
page). Having a one page bitmask was not much of a problem, and that was
used for mapping pids. But today, systems are bigger and can run more
tasks, and now the default pid_max is usually set to 4194304. Which means
to handle that many pids requires 524288 bytes. Worse yet, the pid_max can
be set to 2^30 (1073741824 or 1G) which would take 134217728 (128M) of
memory to store this array.
Since the pid_list array is very sparsely populated, it is a huge waste of
memory to store all possible bits for each pid when most will not be set.
Instead, use a page table scheme to store the array, and allow this to
handle up to 30 bit pids.
The pid_mask will start out with 256 entries for the first 8 MSB bits.
This will cost 1K for 32 bit architectures and 2K for 64 bit. Each of
these will have a 256 array to store the next 8 bits of the pid (another
1 or 2K). These will hold an 2K byte bitmask (which will cover the LSB
14 bits or 16384 pids).
When the trace_pid_list is allocated, it will have the 1/2K upper bits
allocated, and then it will allocate a cache for the next upper chunks and
the lower chunks (default 6 of each). Then when a bit is "set", these
chunks will be pulled from the free list and added to the array. If the
free list gets down to a lever (default 2), it will trigger an irqwork
that will refill the cache back up.
On clearing a bit, if the clear causes the bitmask to be zero, that chunk
will then be placed back into the free cache for later use, keeping the
need to allocate more down to a minimum.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Instead of having the logic that does trace_pid_list open coded, wrap it in
abstract functions. This will allow a rewrite of the logic that implements
the trace_pid_list without affecting the users.
Note, this causes a change in behavior. Every time a pid is written into
the set_*_pid file, it creates a new list and uses RCU to update it. If
pid_max is lowered, but there was a pid currently in the list that was
higher than pid_max, those pids will now be removed on updating the list.
The old behavior kept that from happening.
The rewrite of the pid_list logic will no longer depend on pid_max,
and will return the old behavior.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
Found an issue within cgroup_attach_task_all() fn which seem
to exclude cgrp_dfl_root (cgroupv2) while attaching tasks to
the given cgroup. This was noticed when the system was running
qemu/kvm with kernel vhost helper threads. It appears that the
vhost layer which uses cgroup_attach_task_all() fn to assign the
vhost kthread to the right qemu cgroup works fine with cgroupv1
based configuration but not in cgroupv2. With cgroupv2, the vhost
helper thread ends up just belonging to the root cgroup as is
shown below:
$ stat -fc %T /sys/fs/cgroup/
cgroup2fs
$ sudo pgrep qemu
1916421
$ ps -eL | grep 1916421
1916421 1916421 ? 00:00:01 qemu-system-x86
1916421 1916431 ? 00:00:00 call_rcu
1916421 1916435 ? 00:00:00 IO mon_iothread
1916421 1916436 ? 00:00:34 CPU 0/KVM
1916421 1916439 ? 00:00:00 SPICE Worker
1916421 1916440 ? 00:00:00 vnc_worker
1916433 1916433 ? 00:00:00 vhost-1916421
1916437 1916437 ? 00:00:00 kvm-pit/1916421
$ cat /proc/1916421/cgroup
0::/machine.slice/machine-qemu\x2d18\x2dDroplet\x2d7572850.scope/emulator
$ cat /proc/1916439/cgroup
0::/machine.slice/machine-qemu\x2d18\x2dDroplet\x2d7572850.scope/emulator
$ cat /proc/1916433/cgroup
0::/
From above, it can be seen that the vhost kthread (PID: 1916433)
doesn't seem to belong the qemu cgroup like other qemu PIDs.
After applying this patch:
$ pgrep qemu
1643
$ ps -eL | grep 1643
1643 1643 ? 00:00:00 qemu-system-x86
1643 1645 ? 00:00:00 call_rcu
1643 1648 ? 00:00:00 IO mon_iothread
1643 1649 ? 00:00:00 CPU 0/KVM
1643 1652 ? 00:00:00 SPICE Worker
1643 1653 ? 00:00:00 vnc_worker
1647 1647 ? 00:00:00 vhost-1643
1651 1651 ? 00:00:00 kvm-pit/1643
$ cat /proc/1647/cgroup
0::/machine.slice/machine-qemu\x2d18\x2dDroplet\x2d7572850.scope/emulator
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Vishal Verma <vverma@digitalocean.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
The Energy Model has a 1:1 mapping between OPPs and performance states
(em_perf_state). If a CPUFreq driver registers an Energy Model,
inefficiencies found by the latter can be applied to CPUFreq.
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
The new performance domain flag EM_PERF_DOMAIN_SKIP_INEFFICIENCIES allows
to not take into account inefficient states when estimating energy
consumption. This intends to let the Energy Model know that CPUFreq itself
will skip inefficiencies and such states don't need to be part of the
estimation anymore.
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Merge the current "milliwatts" option into a "flag" field. This intends to
prepare the extension of this structure for inefficient states support in
the Energy Model.
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Some SoCs, such as the sd855 have OPPs within the same performance domain,
whose cost is higher than others with a higher frequency. Even though
those OPPs are interesting from a cooling perspective, it makes no sense
to use them when the device can run at full capacity. Those OPPs handicap
the performance domain, when choosing the most energy-efficient CPU and
are wasting energy. They are inefficient.
Hence, add support for such OPPs to the Energy Model. The table can now
be read skipping inefficient performance states (and by extension,
inefficient OPPs).
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Reviewed-by: Matthias Kaehlcke <mka@chromium.org>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Currently, a debug message is printed if an inefficient state is detected
in the Energy Model. Unfortunately, it won't detect if the first state is
inefficient or if two successive states are. Fix this behavior.
Fixes: 27871f7a8a34 (PM: Introduce an Energy Model management framework)
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Reviewed-by: Quentin Perret <qperret@google.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Reviewed-by: Matthias Kaehlcke <mka@chromium.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Since commit 89aafd67f28c ("sched/fair: Use prev instead of new target as recent_used_cpu"),
p->recent_used_cpu is unconditionnaly set with prev.
Fixes: 89aafd67f28c ("sched/fair: Use prev instead of new target as recent_used_cpu")
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lkml.kernel.org/r/20210928103544.27489-1-vincent.guittot@linaro.org
|
|
Neither wq_worker_sleeping() nor io_wq_worker_sleeping() require to be invoked
with preemption disabled:
- The worker flag checks operations only need to be serialized against
the worker thread itself.
- The accounting and worker pool operations are serialized with locks.
which means that disabling preemption has neither a reason nor a
value. Remove it and update the stale comment.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Link: https://lkml.kernel.org/r/8735pnafj7.ffs@tglx
|
|
Doing cleanups in the tail of schedule() is a latency punishment for the
incoming task. The point of invoking kprobes_task_flush() for a dead task
is that the instances are returned and cannot leak when __schedule() is
kprobed.
Move it into the delayed cleanup.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210928122411.537994026@linutronix.de
|
|
The queued remote wakeup mechanism has turned out to be suboptimal for RT
enabled kernels. The maximum latencies go up by a factor of > 5x in certain
scenarious.
This is caused by either long wake lists or by a large number of TTWU IPIs
which are processed back to back.
Disable it for RT.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210928122411.482262764@linutronix.de
|
|
Batched task migrations are a source for large latencies as they keep the
scheduler from running while processing the migrations.
Limit the batch size to 8 instead of 32 when running on a RT enabled
kernel.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210928122411.425097596@linutronix.de
|
|
mmdrop() is invoked from finish_task_switch() by the incoming task to drop
the mm which was handed over by the previous task. mmdrop() can be quite
expensive which prevents an incoming real-time task from getting useful
work done.
Provide mmdrop_sched() which maps to mmdrop() on !RT kernels. On RT kernels
it delagates the eventually required invocation of __mmdrop() to RCU.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210928122411.648582026@linutronix.de
|
|
Make cookie functions static as these are no longer invoked directly
by other code.
No functional change intended.
Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210922085735.52812-1-zhangshaokun@hisilicon.com
|
|
When deciding to pull tasks in ASYM_PACKING, it is necessary not only to
check for the idle state of the destination CPU, dst_cpu, but also of
its SMT siblings.
If dst_cpu is idle but its SMT siblings are busy, performance suffers
if it pulls tasks from a medium priority CPU that does not have SMT
siblings.
Implement asym_smt_can_pull_tasks() to inspect the state of the SMT
siblings of both dst_cpu and the CPUs in the candidate busiest group.
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Len Brown <len.brown@intel.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210911011819.12184-7-ricardo.neri-calderon@linux.intel.com
|
|
Create a separate function, sched_asym(). A subsequent changeset will
introduce logic to deal with SMT in conjunction with asmymmetric
packing. Such logic will need the statistics of the scheduling
group provided as argument. Update them before calling sched_asym().
Co-developed-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Len Brown <len.brown@intel.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210911011819.12184-6-ricardo.neri-calderon@linux.intel.com
|
|
Before deciding to pull tasks when using asymmetric packing of tasks,
on some architectures (e.g., x86) it is necessary to know not only the
state of dst_cpu but also of its SMT siblings. The decision to classify
a candidate busiest group as group_asym_packing is done in
update_sg_lb_stats(). Give this function access to the scheduling domain
statistics, which contains the statistics of the local group.
Originally-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Len Brown <len.brown@intel.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210911011819.12184-5-ricardo.neri-calderon@linux.intel.com
|
|
sched_asmy_prefer() always returns false when called on the local group. By
checking local_group, we can avoid additional checks and invoking
sched_asmy_prefer() when it is not needed. No functional changes are
introduced.
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Len Brown <len.brown@intel.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210911011819.12184-4-ricardo.neri-calderon@linux.intel.com
|
|
There exist situations in which the load balance needs to know the
properties of the CPUs in a scheduling group. When using asymmetric
packing, for instance, the load balancer needs to know not only the
state of dst_cpu but also of its SMT siblings, if any.
Use the flags of the child scheduling domains to initialize scheduling
group flags. This will reflect the properties of the CPUs in the
group.
A subsequent changeset will make use of these new flags. No functional
changes are introduced.
Originally-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Len Brown <len.brown@intel.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210911011819.12184-3-ricardo.neri-calderon@linux.intel.com
|
|
With enabled threaded interrupts the nouveau driver reported the
following:
| Chain exists of:
| &mm->mmap_lock#2 --> &device->mutex --> &cpuset_rwsem
|
| Possible unsafe locking scenario:
|
| CPU0 CPU1
| ---- ----
| lock(&cpuset_rwsem);
| lock(&device->mutex);
| lock(&cpuset_rwsem);
| lock(&mm->mmap_lock#2);
The device->mutex is nvkm_device::mutex.
Unblocking the lockchain at `cpuset_rwsem' is probably the easiest
thing to do. Move the priority reset to the start of the newly
created thread.
Fixes: 710da3c8ea7df ("sched/core: Prevent race condition between cpuset and __sched_setscheduler()")
Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/a23a826af7c108ea5651e73b8fbae5e653f16e86.camel@gmx.de
|
|
Currently the boot defined preempt behaviour (aka dynamic preempt)
selects full preemption by default when the "preempt=" boot parameter
is omitted. However distros may rather want to default to either
no preemption or voluntary preemption.
To provide with this flexibility, make dynamic preemption a visible
Kconfig option and adapt the preemption behaviour selected by the user
to either static or dynamic preemption.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210914103134.11309-1-frederic@kernel.org
|
|
These is no caller in tree since commit
523e979d3164 ("sched/core: Use PELT for scale_rt_capacity()")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210914095244.52780-1-yuehaibing@huawei.com
|
|
After we make the struct sched_statistics and the helpers of it
independent of fair sched class, we can easily use the schedstats
facility for deadline sched class.
The schedstat usage in DL sched class is similar with fair sched class,
for example,
fair deadline
enqueue update_stats_enqueue_fair update_stats_enqueue_dl
dequeue update_stats_dequeue_fair update_stats_dequeue_dl
put_prev_task update_stats_wait_start update_stats_wait_start_dl
set_next_task update_stats_wait_end update_stats_wait_end_dl
The user can get the schedstats information in the same way in fair sched
class. For example,
fair deadline
/proc/[pid]/sched /proc/[pid]/sched
The output of a deadline task's schedstats as follows,
$ cat /proc/69662/sched
...
se.sum_exec_runtime : 3067.696449
se.nr_migrations : 0
sum_sleep_runtime : 720144.029661
sum_block_runtime : 0.547853
wait_start : 0.000000
sleep_start : 14131540.828955
block_start : 0.000000
sleep_max : 2999.974045
block_max : 0.283637
exec_max : 1.000269
slice_max : 0.000000
wait_max : 0.002217
wait_sum : 0.762179
wait_count : 733
iowait_sum : 0.547853
iowait_count : 3
nr_migrations_cold : 0
nr_failed_migrations_affine : 0
nr_failed_migrations_running : 0
nr_failed_migrations_hot : 0
nr_forced_migrations : 0
nr_wakeups : 246
nr_wakeups_sync : 2
nr_wakeups_migrate : 0
nr_wakeups_local : 244
nr_wakeups_remote : 2
nr_wakeups_affine : 0
nr_wakeups_affine_attempts : 0
nr_wakeups_passive : 0
nr_wakeups_idle : 0
...
The sched:sched_stat_{wait, sleep, iowait, blocked} tracepoints can
be used to trace deadlline tasks as well.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210905143547.4668-9-laoar.shao@gmail.com
|