Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull pidfd updates from Christian Brauner:
"This adds a new pidfd_prepare() helper which allows the caller to
reserve a pidfd number and allocates a new pidfd file that stashes the
provided struct pid.
It should be avoided installing a file descriptor into a task's file
descriptor table just to close it again via close_fd() in case an
error occurs. The fd has been visible to userspace and might already
be in use. Instead, a file descriptor should be reserved but not
installed into the caller's file descriptor table.
If another failure path is hit then the reserved file descriptor and
file can just be put without any userspace visible side-effects. And
if all failure paths are cleared the file descriptor and file can be
installed into the task's file descriptor table.
This helper is now used in all places that open coded this
functionality before. For example, this is currently done during
copy_process() and fanotify used pidfd_create(), which returns a pidfd
that has already been made visibile in the caller's file descriptor
table, but then closed it using close_fd().
In one of the next merge windows there is also new functionality
coming to unix domain sockets that will have to rely on
pidfd_prepare()"
* tag 'v6.4/pidfd.file' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
fanotify: use pidfd_prepare()
fork: use pidfd_prepare()
pid: add pidfd_prepare()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull user work thread updates from Christian Brauner:
"This contains the work generalizing the ability to create a kernel
worker from a userspace process.
Such user workers will run with the same credentials as the userspace
process they were created from providing stronger security and
accounting guarantees than the traditional override_creds() approach
ever could've hoped for.
The original work was heavily based and optimzed for the needs of
io_uring which was the first user. However, as it quickly turned out
the ability to create user workers inherting properties from a
userspace process is generally useful.
The vhost subsystem currently creates workers using the kthread api.
The consequences of using the kthread api are that RLIMITs don't work
correctly as they are inherited from khtreadd. This leads to bugs
where more workers are created than would be allowed by the RLIMITs of
the userspace process in lieu of which workers are created.
Problems like this disappear with user workers created from the
userspace processes for which they perform the work. In addition,
providing this api allows vhost to remove additional complexity. For
example, cgroup and mm sharing will just work out of the box with user
workers based on the relevant userspace process instead of manually
ensuring the correct cgroup and mm contexts are used.
So the vhost subsystem should simply be made to use the same mechanism
as io_uring. To this end the original mechanism used for
create_io_thread() is generalized into user workers:
- Introduce PF_USER_WORKER as a generic indicator that a given task
is a user worker, i.e., a kernel task that was created from a
userspace process. Now a PF_IO_WORKER thread is just a specialized
version of PF_USER_WORKER. So io_uring io workers raise both flags.
- Make copy_process() available to core kernel code
- Extend struct kernel_clone_args with the following bitfields
allowing to indicate to copy_process():
- to create a user worker (raise PF_USER_WORKER)
- to not inherit any files from the userspace process
- to ignore signals
After all generic changes are in place the vhost subsystem implements
a new dedicated vhost api based on user workers. Finally, vhost is
switched to rely on the new api moving it off of kthreads.
Thanks to Mike for sticking it out and making it through this rather
arduous journey"
* tag 'v6.4/kernel.user_worker' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
vhost: use vhost_tasks for worker threads
vhost: move worker thread fields to new struct
vhost_task: Allow vhost layer to use copy_process
fork: allow kernel code to call copy_process
fork: Add kernel_clone_args flag to ignore signals
fork: add kernel_clone_args flag to not dup/clone files
fork/vm: Move common PF_IO_WORKER behavior to new flag
kernel: Make io_thread and kthread bit fields
kthread: Pass in the thread's name during creation
kernel: Allow a kernel thread's name to be set in copy_process
csky: Remove kernel_thread declaration
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux
Pull RCU updates from Joel Fernandes:
- Updates and additions to MAINTAINERS files, with Boqun being added to
the RCU entry and Zqiang being added as an RCU reviewer.
I have also transitioned from reviewer to maintainer; however, Paul
will be taking over sending RCU pull-requests for the next merge
window.
- Resolution of hotplug warning in nohz code, achieved by fixing
cpu_is_hotpluggable() through interaction with the nohz subsystem.
Tick dependency modifications by Zqiang, focusing on fixing usage of
the TICK_DEP_BIT_RCU_EXP bitmask.
- Avoid needless calls to the rcu-lazy shrinker for CONFIG_RCU_LAZY=n
kernels, fixed by Zqiang.
- Improvements to rcu-tasks stall reporting by Neeraj.
- Initial renaming of k[v]free_rcu() to k[v]free_rcu_mightsleep() for
increased robustness, affecting several components like mac802154,
drbd, vmw_vmci, tracing, and more.
A report by Eric Dumazet showed that the API could be unknowingly
used in an atomic context, so we'd rather make sure they know what
they're asking for by being explicit:
https://lore.kernel.org/all/20221202052847.2623997-1-edumazet@google.com/
- Documentation updates, including corrections to spelling,
clarifications in comments, and improvements to the srcu_size_state
comments.
- Better srcu_struct cache locality for readers, by adjusting the size
of srcu_struct in support of SRCU usage by Christoph Hellwig.
- Teach lockdep to detect deadlocks between srcu_read_lock() vs
synchronize_srcu() contributed by Boqun.
Previously lockdep could not detect such deadlocks, now it can.
- Integration of rcutorture and rcu-related tools, targeted for v6.4
from Boqun's tree, featuring new SRCU deadlock scenarios, test_nmis
module parameter, and more
- Miscellaneous changes, various code cleanups and comment improvements
* tag 'rcu.6.4.april5.2023.3' of git://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux: (71 commits)
checkpatch: Error out if deprecated RCU API used
mac802154: Rename kfree_rcu() to kvfree_rcu_mightsleep()
rcuscale: Rename kfree_rcu() to kfree_rcu_mightsleep()
ext4/super: Rename kfree_rcu() to kfree_rcu_mightsleep()
net/mlx5: Rename kfree_rcu() to kfree_rcu_mightsleep()
net/sysctl: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
lib/test_vmalloc.c: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
tracing: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
misc: vmw_vmci: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
drbd: Rename kvfree_rcu() to kvfree_rcu_mightsleep()
rcu: Protect rcu_print_task_exp_stall() ->exp_tasks access
rcu: Avoid stack overflow due to __rcu_irq_enter_check_tick() being kprobe-ed
rcu-tasks: Report stalls during synchronize_srcu() in rcu_tasks_postscan()
rcu: Permit start_poll_synchronize_rcu_expedited() to be invoked early
rcu: Remove never-set needwake assignment from rcu_report_qs_rdp()
rcu: Register rcu-lazy shrinker only for CONFIG_RCU_LAZY=y kernels
rcu: Fix missing TICK_DEP_MASK_RCU_EXP dependency check
rcu: Fix set/clear TICK_DEP_BIT_RCU_EXP bitmask race
rcu/trace: use strscpy() to instead of strncpy()
tick/nohz: Fix cpu_is_hotpluggable() by checking with nohz subsystem
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull locktorture updates from Paul McKenney:
"This adds tests for nested locking and also adds support for testing
raw spinlocks in PREEMPT_RT kernels"
* tag 'locktorture.2023.04.04a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
locktorture: Add raw_spinlock* torture tests for PREEMPT_RT kernels
locktorture: With nested locks, occasionally skip main lock
locktorture: Add nested locking to rtmutex torture tests
locktorture: Add nested locking to mutex torture tests
locktorture: Add nested_[un]lock() hooks and nlocks parameter
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull KCSAN updates from Paul McKenney:
"Kernel concurrency sanitizer (KCSAN) updates for v6.4
This fixes kernel-doc warnings and also updates instrumentation from
READ_ONCE() to volatile in order to avoid unaligned load-acquire
instructions on arm64 in kernels built with LTO"
* tag 'kcsan.2023.04.04a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
kcsan: Avoid READ_ONCE() in read_instrumented_memory()
instrumented.h: Fix all kernel-doc format warnings
|
|
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2023-04-21
We've added 71 non-merge commits during the last 8 day(s) which contain
a total of 116 files changed, 13397 insertions(+), 8896 deletions(-).
The main changes are:
1) Add a new BPF netfilter program type and minimal support to hook
BPF programs to netfilter hooks such as prerouting or forward,
from Florian Westphal.
2) Fix race between btf_put and btf_idr walk which caused a deadlock,
from Alexei Starovoitov.
3) Second big batch to migrate test_verifier unit tests into test_progs
for ease of readability and debugging, from Eduard Zingerman.
4) Add support for refcounted local kptrs to the verifier for allowing
shared ownership, useful for adding a node to both the BPF list and
rbtree, from Dave Marchevsky.
5) Migrate bpf_for(), bpf_for_each() and bpf_repeat() macros from BPF
selftests into libbpf-provided bpf_helpers.h header and improve
kfunc handling, from Andrii Nakryiko.
6) Support 64-bit pointers to kfuncs needed for archs like s390x,
from Ilya Leoshkevich.
7) Support BPF progs under getsockopt with a NULL optval,
from Stanislav Fomichev.
8) Improve verifier u32 scalar equality checking in order to enable
LLVM transformations which earlier had to be disabled specifically
for BPF backend, from Yonghong Song.
9) Extend bpftool's struct_ops object loading to support links,
from Kui-Feng Lee.
10) Add xsk selftest follow-up fixes for hugepage allocated umem,
from Magnus Karlsson.
11) Support BPF redirects from tc BPF to ifb devices,
from Daniel Borkmann.
12) Add BPF support for integer type when accessing variable length
arrays, from Feng Zhou.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (71 commits)
selftests/bpf: verifier/value_ptr_arith converted to inline assembly
selftests/bpf: verifier/value_illegal_alu converted to inline assembly
selftests/bpf: verifier/unpriv converted to inline assembly
selftests/bpf: verifier/subreg converted to inline assembly
selftests/bpf: verifier/spin_lock converted to inline assembly
selftests/bpf: verifier/sock converted to inline assembly
selftests/bpf: verifier/search_pruning converted to inline assembly
selftests/bpf: verifier/runtime_jit converted to inline assembly
selftests/bpf: verifier/regalloc converted to inline assembly
selftests/bpf: verifier/ref_tracking converted to inline assembly
selftests/bpf: verifier/map_ptr_mixing converted to inline assembly
selftests/bpf: verifier/map_in_map converted to inline assembly
selftests/bpf: verifier/lwt converted to inline assembly
selftests/bpf: verifier/loops1 converted to inline assembly
selftests/bpf: verifier/jeq_infer_not_null converted to inline assembly
selftests/bpf: verifier/direct_packet_access converted to inline assembly
selftests/bpf: verifier/d_path converted to inline assembly
selftests/bpf: verifier/ctx converted to inline assembly
selftests/bpf: verifier/btf_ctx_access converted to inline assembly
selftests/bpf: verifier/bpf_get_stack converted to inline assembly
...
====================
Link: https://lore.kernel.org/r/20230421211035.9111-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Patch series "mm: process/cgroup ksm support", v9.
So far KSM can only be enabled by calling madvise for memory regions. To
be able to use KSM for more workloads, KSM needs to have the ability to be
enabled / disabled at the process / cgroup level.
Use case 1:
The madvise call is not available in the programming language. An
example for this are programs with forked workloads using a garbage
collected language without pointers. In such a language madvise cannot
be made available.
In addition the addresses of objects get moved around as they are
garbage collected. KSM sharing needs to be enabled "from the outside"
for these type of workloads.
Use case 2:
The same interpreter can also be used for workloads where KSM brings
no benefit or even has overhead. We'd like to be able to enable KSM on
a workload by workload basis.
Use case 3:
With the madvise call sharing opportunities are only enabled for the
current process: it is a workload-local decision. A considerable number
of sharing opportunities may exist across multiple workloads or jobs (if
they are part of the same security domain). Only a higler level entity
like a job scheduler or container can know for certain if its running
one or more instances of a job. That job scheduler however doesn't have
the necessary internal workload knowledge to make targeted madvise
calls.
Security concerns:
In previous discussions security concerns have been brought up. The
problem is that an individual workload does not have the knowledge about
what else is running on a machine. Therefore it has to be very
conservative in what memory areas can be shared or not. However, if the
system is dedicated to running multiple jobs within the same security
domain, its the job scheduler that has the knowledge that sharing can be
safely enabled and is even desirable.
Performance:
Experiments with using UKSM have shown a capacity increase of around 20%.
Here are the metrics from an instagram workload (taken from a machine
with 64GB main memory):
full_scans: 445
general_profit: 20158298048
max_page_sharing: 256
merge_across_nodes: 1
pages_shared: 129547
pages_sharing: 5119146
pages_to_scan: 4000
pages_unshared: 1760924
pages_volatile: 10761341
run: 1
sleep_millisecs: 20
stable_node_chains: 167
stable_node_chains_prune_millisecs: 2000
stable_node_dups: 2751
use_zero_pages: 0
zero_pages_sharing: 0
After the service is running for 30 minutes to an hour, 4 to 5 million
shared pages are common for this workload when using KSM.
Detailed changes:
1. New options for prctl system command
This patch series adds two new options to the prctl system call.
The first one allows to enable KSM at the process level and the second
one to query the setting.
The setting will be inherited by child processes.
With the above setting, KSM can be enabled for the seed process of a cgroup
and all processes in the cgroup will inherit the setting.
2. Changes to KSM processing
When KSM is enabled at the process level, the KSM code will iterate
over all the VMA's and enable KSM for the eligible VMA's.
When forking a process that has KSM enabled, the setting will be
inherited by the new child process.
3. Add general_profit metric
The general_profit metric of KSM is specified in the documentation,
but not calculated. This adds the general profit metric to
/sys/kernel/debug/mm/ksm.
4. Add more metrics to ksm_stat
This adds the process profit metric to /proc/<pid>/ksm_stat.
5. Add more tests to ksm_tests and ksm_functional_tests
This adds an option to specify the merge type to the ksm_tests.
This allows to test madvise and prctl KSM.
It also adds a two new tests to ksm_functional_tests: one to test
the new prctl options and the other one is a fork test to verify that
the KSM process setting is inherited by client processes.
This patch (of 3):
So far KSM can only be enabled by calling madvise for memory regions. To
be able to use KSM for more workloads, KSM needs to have the ability to be
enabled / disabled at the process / cgroup level.
1. New options for prctl system command
This patch series adds two new options to the prctl system call.
The first one allows to enable KSM at the process level and the second
one to query the setting.
The setting will be inherited by child processes.
With the above setting, KSM can be enabled for the seed process of a
cgroup and all processes in the cgroup will inherit the setting.
2. Changes to KSM processing
When KSM is enabled at the process level, the KSM code will iterate
over all the VMA's and enable KSM for the eligible VMA's.
When forking a process that has KSM enabled, the setting will be
inherited by the new child process.
1) Introduce new MMF_VM_MERGE_ANY flag
This introduces the new flag MMF_VM_MERGE_ANY flag. When this flag
is set, kernel samepage merging (ksm) gets enabled for all vma's of a
process.
2) Setting VM_MERGEABLE on VMA creation
When a VMA is created, if the MMF_VM_MERGE_ANY flag is set, the
VM_MERGEABLE flag will be set for this VMA.
3) support disabling of ksm for a process
This adds the ability to disable ksm for a process if ksm has been
enabled for the process with prctl.
4) add new prctl option to get and set ksm for a process
This adds two new options to the prctl system call
- enable ksm for all vmas of a process (if the vmas support it).
- query if ksm has been enabled for a process.
3. Disabling MMF_VM_MERGE_ANY for storage keys in s390
In the s390 architecture when storage keys are used, the
MMF_VM_MERGE_ANY will be disabled.
Link: https://lkml.kernel.org/r/20230418051342.1919757-1-shr@devkernel.io
Link: https://lkml.kernel.org/r/20230418051342.1919757-2-shr@devkernel.io
Signed-off-by: Stefan Roesch <shr@devkernel.io>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This adds minimal support for BPF_PROG_TYPE_NETFILTER bpf programs
that will be invoked via the NF_HOOK() points in the ip stack.
Invocation incurs an indirect call. This is not a necessity: Its
possible to add 'DEFINE_BPF_DISPATCHER(nf_progs)' and handle the
program invocation with the same method already done for xdp progs.
This isn't done here to keep the size of this chunk down.
Verifier restricts verdicts to either DROP or ACCEPT.
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20230421170300.24115-3-fw@strlen.de
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Add bpf_link support skeleton. To keep this reviewable, no bpf program
can be invoked yet, if a program is attached only a c-stub is called and
not the actual bpf program.
Defaults to 'y' if both netfilter and bpf syscall are enabled in kconfig.
Uapi example usage:
union bpf_attr attr = { };
attr.link_create.prog_fd = progfd;
attr.link_create.attach_type = 0; /* unused */
attr.link_create.netfilter.pf = PF_INET;
attr.link_create.netfilter.hooknum = NF_INET_LOCAL_IN;
attr.link_create.netfilter.priority = -128;
err = bpf(BPF_LINK_CREATE, &attr, sizeof(attr));
... this would attach progfd to ipv4:input hook.
Such hook gets removed automatically if the calling program exits.
BPF_NETFILTER program invocation is added in followup change.
NF_HOOK_OP_BPF enum will eventually be read from nfnetlink_hook, it
allows to tell userspace which program is attached at the given hook
when user runs 'nft hook list' command rather than just the priority
and not-very-helpful 'this hook runs a bpf prog but I can't tell which
one'.
Will also be used to disallow registration of two bpf programs with
same priority in a followup patch.
v4: arm32 cmpxchg only supports 32bit operand
s/prio/priority/
v3: restrict prog attachment to ip/ip6 for now, lets lift restrictions if
more use cases pop up (arptables, ebtables, netdev ingress/egress etc).
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20230421170300.24115-2-fw@strlen.de
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Some socket options do getsockopt with optval=NULL to estimate the size
of the final buffer (which is returned via optlen). This breaks BPF
getsockopt assumptions about permitted optval buffer size. Let's enforce
these assumptions only when non-NULL optval is provided.
Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks")
Reported-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/ZD7Js4fj5YyI2oLd@google.com/T/#mb68daf700f87a9244a15d01d00c3f0e5b08f49f7
Link: https://lore.kernel.org/bpf/20230418225343.553806-2-sdf@google.com
|
|
When calculating the address of the refcount_t struct within a local
kptr, bpf_refcount_acquire_impl should add refcount_off bytes to the
address of the local kptr. Due to some missing parens, the function is
incorrectly adding sizeof(refcount_t) * refcount_off bytes. This patch
fixes the calculation.
Due to the incorrect calculation, bpf_refcount_acquire_impl was trying
to refcount_inc some memory well past the end of local kptrs, resulting
in kasan and refcount complaints, as reported in [0]. In that thread,
Florian and Eduard discovered that bpf selftests written in the new
style - with __success and an expected __retval, specifically - were
not actually being run. As a result, selftests added in bpf_refcount
series weren't really exercising this behavior, and thus didn't unearth
the bug.
With this fixed behavior it's safe to revert commit 7c4b96c00043
("selftests/bpf: disable program test run for progs/refcounted_kptr.c"),
this patch does so.
[0] https://lore.kernel.org/bpf/ZEEp+j22imoN6rn9@strlen.de/
Fixes: 7c50b1cb76ac ("bpf: Add bpf_refcount_acquire kfunc")
Reported-by: Florian Westphal <fw@strlen.de>
Reported-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20230421074431.3548349-1-davemarchevsky@fb.com
|
|
Florian and Eduard reported hard dead lock:
[ 58.433327] _raw_spin_lock_irqsave+0x40/0x50
[ 58.433334] btf_put+0x43/0x90
[ 58.433338] bpf_find_btf_id+0x157/0x240
[ 58.433353] btf_parse_fields+0x921/0x11c0
This happens since btf->refcount can be 1 at the time of btf_put() and
btf_put() will call btf_free_id() which will try to grab btf_idr_lock
and will dead lock.
Avoid the issue by doing btf_put() without locking.
Fixes: 3d78417b60fb ("bpf: Add bpf_btf_find_by_name_kind() helper.")
Fixes: 1e89106da253 ("bpf: Add bpf_core_add_cands() and wire it into bpf_core_apply_relo_insn().")
Reported-by: Florian Westphal <fw@strlen.de>
Reported-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20230421014901.70908-1-alexei.starovoitov@gmail.com
|
|
For some unknown reason the introduction of the timer_wait_running callback
missed to fixup posix CPU timers, which went unnoticed for almost four years.
Marco reported recently that the WARN_ON() in timer_wait_running()
triggers with a posix CPU timer test case.
Posix CPU timers have two execution models for expiring timers depending on
CONFIG_POSIX_CPU_TIMERS_TASK_WORK:
1) If not enabled, the expiry happens in hard interrupt context so
spin waiting on the remote CPU is reasonably time bound.
Implement an empty stub function for that case.
2) If enabled, the expiry happens in task work before returning to user
space or guest mode. The expired timers are marked as firing and moved
from the timer queue to a local list head with sighand lock held. Once
the timers are moved, sighand lock is dropped and the expiry happens in
fully preemptible context. That means the expiring task can be scheduled
out, migrated, interrupted etc. So spin waiting on it is more than
suboptimal.
The timer wheel has a timer_wait_running() mechanism for RT, which uses
a per CPU timer-base expiry lock which is held by the expiry code and the
task waiting for the timer function to complete blocks on that lock.
This does not work in the same way for posix CPU timers as there is no
timer base and expiry for process wide timers can run on any task
belonging to that process, but the concept of waiting on an expiry lock
can be used too in a slightly different way:
- Add a mutex to struct posix_cputimers_work. This struct is per task
and used to schedule the expiry task work from the timer interrupt.
- Add a task_struct pointer to struct cpu_timer which is used to store
a the task which runs the expiry. That's filled in when the task
moves the expired timers to the local expiry list. That's not
affecting the size of the k_itimer union as there are bigger union
members already
- Let the task take the expiry mutex around the expiry function
- Let the waiter acquire a task reference with rcu_read_lock() held and
block on the expiry mutex
This avoids spin-waiting on a task which might not even be on a CPU and
works nicely for RT too.
Fixes: ec8f954a40da ("posix-timers: Use a callback for cancel synchronization on PREEMPT_RT")
Reported-by: Marco Elver <elver@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Marco Elver <elver@google.com>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/87zg764ojw.ffs@tglx
|
|
Have local_clock() return sched_clock() if sched_clock_init() has not
yet run. sched_clock_cpu() has this check but it was not included in the
new noinstr implementation of local_clock().
The effect can be seen on x86 with CONFIG_PRINTK_TIME enabled, for
instance. scd->clock quickly reaches the value of TICK_NSEC and that
value is returned until sched_clock_init() runs.
dmesg without this patch:
[ 0.000000] kvm-clock: ...
[ 0.000002] kvm-clock: ...
[ 0.000672] clocksource: ...
[ 0.001000] tsc: ...
[ 0.001000] e820: ...
[ 0.001000] e820: ...
...
[ 0.001000] ..TIMER: ...
[ 0.001000] clocksource: ...
[ 0.378956] Calibrating delay loop ...
[ 0.379955] pid_max: ...
dmesg with this patch:
[ 0.000000] kvm-clock: ...
[ 0.000001] kvm-clock: ...
[ 0.000675] clocksource: ...
[ 0.002685] tsc: ...
[ 0.003331] e820: ...
[ 0.004190] e820: ...
...
[ 0.421939] ..TIMER: ...
[ 0.422842] clocksource: ...
[ 0.424582] Calibrating delay loop ...
[ 0.425580] pid_max: ...
Fixes: 776f22913b8e ("sched/clock: Make local_clock() noinstr")
Signed-off-by: Aaron Thompson <dev@aaront.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230413175012.2201-1-dev@aaront.org
|
|
Commit 95158a89dd50 ("sched,rt: Use the full cpumask for balancing")
allows find_lock_lowest_rq() to pick a task with migration disabled.
The purpose of the commit is to push the current running task on the
CPU that has the migrate_disable() task away.
However, there is a race which allows a migrate_disable() task to be
migrated. Consider:
CPU0 CPU1
push_rt_task
check is_migration_disabled(next_task)
task not running and
migration_disabled == 0
find_lock_lowest_rq(next_task, rq);
_double_lock_balance(this_rq, busiest);
raw_spin_rq_unlock(this_rq);
double_rq_lock(this_rq, busiest);
<<wait for busiest rq>>
<wakeup>
task become running
migrate_disable();
<context out>
deactivate_task(rq, next_task, 0);
set_task_cpu(next_task, lowest_rq->cpu);
WARN_ON_ONCE(is_migration_disabled(p));
Fixes: 95158a89dd50 ("sched,rt: Use the full cpumask for balancing")
Signed-off-by: Schspa Shi <schspa@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Tested-by: Dwaine Gonyier <dgonyier@redhat.com>
|
|
Introduce per-mm/cpu current concurrency id (mm_cid) to fix a PostgreSQL
sysbench regression reported by Aaron Lu.
Keep track of the currently allocated mm_cid for each mm/cpu rather than
freeing them immediately on context switch. This eliminates most atomic
operations when context switching back and forth between threads
belonging to different memory spaces in multi-threaded scenarios (many
processes, each with many threads). The per-mm/per-cpu mm_cid values are
serialized by their respective runqueue locks.
Thread migration is handled by introducing invocation to
sched_mm_cid_migrate_to() (with destination runqueue lock held) in
activate_task() for migrating tasks. If the destination cpu's mm_cid is
unset, and if the source runqueue is not actively using its mm_cid, then
the source cpu's mm_cid is moved to the destination cpu on migration.
Introduce a task-work executed periodically, similarly to NUMA work,
which delays reclaim of cid values when they are unused for a period of
time.
Keep track of the allocation time for each per-cpu cid, and let the task
work clear them when they are observed to be older than
SCHED_MM_CID_PERIOD_NS and unused. This task work also clears all
mm_cids which are greater or equal to the Hamming weight of the mm
cidmask to keep concurrency ids compact.
Because we want to ensure the mm_cid converges towards the smaller
values as migrations happen, the prior optimization that was done when
context switching between threads belonging to the same mm is removed,
because it could delay the lazy release of the destination runqueue
mm_cid after it has been replaced by a migration. Removing this prior
optimization is not an issue performance-wise because the introduced
per-mm/per-cpu mm_cid tracking also covers this more specific case.
Fixes: af7f588d8f73 ("sched: Introduce per-memory-map concurrency ID")
Reported-by: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Aaron Lu <aaron.lu@intel.com>
Link: https://lore.kernel.org/lkml/20230327080502.GA570847@ziqianlu-desk2/
|
|
Sync with the urgent patches; in particular:
a53ce18cacb4 ("sched/fair: Sanitize vruntime of entity being migrated")
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
|
|
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Adjacent changes:
net/mptcp/protocol.h
63740448a32e ("mptcp: fix accept vs worker race")
2a6a870e44dd ("mptcp: stops worker on unaccepted sockets at listener close")
ddb1a072f858 ("mptcp: move first subflow allocation at mpc access time")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from netfilter and bpf.
There are a few fixes for new code bugs, including the Mellanox one
noted in the last networking pull. No known regressions outstanding.
Current release - regressions:
- sched: clear actions pointer in miss cookie init fail
- mptcp: fix accept vs worker race
- bpf: fix bpf_arch_text_poke() with new_addr == NULL on s390
- eth: bnxt_en: fix a possible NULL pointer dereference in unload
path
- eth: veth: take into account peer device for
NETDEV_XDP_ACT_NDO_XMIT xdp_features flag
Current release - new code bugs:
- eth: revert "net/mlx5: Enable management PF initialization"
Previous releases - regressions:
- netfilter: fix recent physdev match breakage
- bpf: fix incorrect verifier pruning due to missing register
precision taints
- eth: virtio_net: fix overflow inside xdp_linearize_page()
- eth: cxgb4: fix use after free bugs caused by circular dependency
problem
- eth: mlxsw: pci: fix possible crash during initialization
Previous releases - always broken:
- sched: sch_qfq: prevent slab-out-of-bounds in qfq_activate_agg
- netfilter: validate catch-all set elements
- bridge: don't notify FDB entries with "master dynamic"
- eth: bonding: fix memory leak when changing bond type to ethernet
- eth: i40e: fix accessing vsi->active_filters without holding lock
Misc:
- Mat is back as MPTCP co-maintainer"
* tag 'net-6.3-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (33 commits)
net: bridge: switchdev: don't notify FDB entries with "master dynamic"
Revert "net/mlx5: Enable management PF initialization"
MAINTAINERS: Resume MPTCP co-maintainer role
mailmap: add entries for Mat Martineau
e1000e: Disable TSO on i219-LM card to increase speed
bnxt_en: fix free-runnig PHC mode
net: dsa: microchip: ksz8795: Correctly handle huge frame configuration
bpf: Fix incorrect verifier pruning due to missing register precision taints
hamradio: drop ISA_DMA_API dependency
mlxsw: pci: Fix possible crash during initialization
mptcp: fix accept vs worker race
mptcp: stops worker on unaccepted sockets at listener close
net: rpl: fix rpl header size calculation
net: vmxnet3: Fix NULL pointer dereference in vmxnet3_rq_rx_complete()
bonding: Fix memory leak when changing bond type to Ethernet
veth: take into account peer device for NETDEV_XDP_ACT_NDO_XMIT xdp_features flag
mlxfw: fix null-ptr-deref in mlxfw_mfa2_tlv_next()
bnxt_en: Fix a possible NULL pointer dereference in unload path
bnxt_en: Do not initialize PTP on older P3/P4 chips
netfilter: nf_tables: tighten netlink attribute requirements for catch-all elements
...
|
|
Userspace can't easily discover how much of a sleep cycle was spent in a
hardware sleep state without using kernel tracing and vendor specific sysfs
or debugfs files.
To make this information more discoverable, introduce 3 new sysfs files:
1) The time spent in a hw sleep state for last cycle.
2) The time spent in a hw sleep state since the kernel booted
3) The maximum time that the hardware can report for a sleep cycle.
All of these files will be present only if the system supports s2idle.
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
The tracking of used_hiwater adds an atomic operation to the hot
path. This is acceptable only when debugging the kernel. To make
sure that the fields can never be used by mistake, do not even
include them in struct io_tlb_mem if CONFIG_DEBUG_FS is not set.
The build fails after doing that. To fix it, it is necessary to
remove all code specific to debugfs and instead provide a stub
implementation of swiotlb_create_debugfs_files(). As a bonus, this
change allows to remove one __maybe_unused attribute.
Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
In the old days where each device had a custom kernel, the
android config fragments were useful to provide the required
and reccomended options expected by userland.
However, these days devices are expected to use the GKI kernel,
so these config fragments no longer needed, and out of date, so
they seem to only cause confusion.
So lets drop them. If folks are curious what configs are
expected by the Android environment, check out the gki_defconfig
file in the latest android common kernel tree.
Cc: Rob Herring <robh@kernel.org>
Cc: Amit Pundir <amit.pundir@linaro.org>
Cc: <kernel-team@android.com>
Signed-off-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/r/20230411180409.1706067-1-jstultz@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Factor out the code that fills the stack with the stackleak poison value
in order to allow architectures to provide a faster implementation.
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20230405130841.1350565-2-hca@linux.ibm.com
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
|
|
After this commit:
bpf: Support variable length array in tracing programs (9c5f8a1008a1)
Trace programs can access variable length array, but for structure
type. This patch adds support for integer type.
Example:
Hook load_balance
struct sched_domain {
...
unsigned long span[];
}
The access: sd->span[0].
Co-developed-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com>
Link: https://lore.kernel.org/r/20230420032735.27760-2-zhoufeng.zf@bytedance.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"22 hotfixes.
19 are cc:stable and the remainder address issues which were
introduced during this merge cycle, or aren't considered suitable for
-stable backporting.
19 are for MM and the remainder are for other subsystems"
* tag 'mm-hotfixes-stable-2023-04-19-16-36' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (22 commits)
nilfs2: initialize unused bytes in segment summary blocks
mm: page_alloc: skip regions with hugetlbfs pages when allocating 1G pages
mm/mmap: regression fix for unmapped_area{_topdown}
maple_tree: fix mas_empty_area() search
maple_tree: make maple state reusable after mas_empty_area_rev()
mm: kmsan: handle alloc failures in kmsan_ioremap_page_range()
mm: kmsan: handle alloc failures in kmsan_vmap_pages_range_noflush()
tools/Makefile: do missed s/vm/mm/
mm: fix memory leak on mm_init error handling
mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock
kernel/sys.c: fix and improve control flow in __sys_setres[ug]id()
Revert "userfaultfd: don't fail on unrecognized features"
writeback, cgroup: fix null-ptr-deref write in bdi_split_work_to_wbs
maple_tree: fix a potential memory leak, OOB access, or other unpredictable bug
tools/mm/page_owner_sort.c: fix TGID output when cull=tg is used
mailmap: update jtoppins' entry to reference correct email
mm/mempolicy: fix use-after-free of VMA iterator
mm/huge_memory.c: warn with pr_warn_ratelimited instead of VM_WARN_ON_ONCE_FOLIO
mm/mprotect: fix do_mprotect_pkey() return on error
mm/khugepaged: check again on anon uffd-wp during isolation
...
|
|
The finit_module() system call can in the worst case use up to more than
twice of a module's size in virtual memory. Duplicate finit_module()
system calls are non fatal, however they unnecessarily strain virtual
memory during bootup and in the worst case can cause a system to fail
to boot. This is only known to currently be an issue on systems with
larger number of CPUs.
To help debug this situation we need to consider the different sources for
finit_module(). Requests from the kernel that rely on module auto-loading,
ie, the kernel's *request_module() API, are one source of calls. Although
modprobe checks to see if a module is already loaded prior to calling
finit_module() there is a small race possible allowing userspace to
trigger multiple modprobe calls racing against modprobe and this not
seeing the module yet loaded.
This adds debugging support to the kernel module auto-loader (*request_module()
calls) to easily detect duplicate module requests. To aid with possible bootup
failure issues incurred by this, it will converge duplicates requests to a
single request. This avoids any possible strain on virtual memory during
bootup which could be incurred by duplicate module autoloading requests.
Folks debugging virtual memory abuse on bootup can and should enable
this to see what pr_warn()s come on, to see if module auto-loading is to
blame for their wores. If they see duplicates they can further debug this
by enabling the module.enable_dups_trace kernel parameter or by enabling
CONFIG_MODULE_DEBUG_AUTOLOAD_DUPS_TRACE.
Current evidence seems to point to only a few duplicates for module
auto-loading. And so the source for other duplicates creating heavy
virtual memory pressure due to larger number of CPUs should becoming
from another place (likely udev).
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
|
|
Juan Jose et al reported an issue found via fuzzing where the verifier's
pruning logic prematurely marks a program path as safe.
Consider the following program:
0: (b7) r6 = 1024
1: (b7) r7 = 0
2: (b7) r8 = 0
3: (b7) r9 = -2147483648
4: (97) r6 %= 1025
5: (05) goto pc+0
6: (bd) if r6 <= r9 goto pc+2
7: (97) r6 %= 1
8: (b7) r9 = 0
9: (bd) if r6 <= r9 goto pc+1
10: (b7) r6 = 0
11: (b7) r0 = 0
12: (63) *(u32 *)(r10 -4) = r0
13: (18) r4 = 0xffff888103693400 // map_ptr(ks=4,vs=48)
15: (bf) r1 = r4
16: (bf) r2 = r10
17: (07) r2 += -4
18: (85) call bpf_map_lookup_elem#1
19: (55) if r0 != 0x0 goto pc+1
20: (95) exit
21: (77) r6 >>= 10
22: (27) r6 *= 8192
23: (bf) r1 = r0
24: (0f) r0 += r6
25: (79) r3 = *(u64 *)(r0 +0)
26: (7b) *(u64 *)(r1 +0) = r3
27: (95) exit
The verifier treats this as safe, leading to oob read/write access due
to an incorrect verifier conclusion:
func#0 @0
0: R1=ctx(off=0,imm=0) R10=fp0
0: (b7) r6 = 1024 ; R6_w=1024
1: (b7) r7 = 0 ; R7_w=0
2: (b7) r8 = 0 ; R8_w=0
3: (b7) r9 = -2147483648 ; R9_w=-2147483648
4: (97) r6 %= 1025 ; R6_w=scalar()
5: (05) goto pc+0
6: (bd) if r6 <= r9 goto pc+2 ; R6_w=scalar(umin=18446744071562067969,var_off=(0xffffffff00000000; 0xffffffff)) R9_w=-2147483648
7: (97) r6 %= 1 ; R6_w=scalar()
8: (b7) r9 = 0 ; R9=0
9: (bd) if r6 <= r9 goto pc+1 ; R6=scalar(umin=1) R9=0
10: (b7) r6 = 0 ; R6_w=0
11: (b7) r0 = 0 ; R0_w=0
12: (63) *(u32 *)(r10 -4) = r0
last_idx 12 first_idx 9
regs=1 stack=0 before 11: (b7) r0 = 0
13: R0_w=0 R10=fp0 fp-8=0000????
13: (18) r4 = 0xffff8ad3886c2a00 ; R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
15: (bf) r1 = r4 ; R1_w=map_ptr(off=0,ks=4,vs=48,imm=0) R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
16: (bf) r2 = r10 ; R2_w=fp0 R10=fp0
17: (07) r2 += -4 ; R2_w=fp-4
18: (85) call bpf_map_lookup_elem#1 ; R0=map_value_or_null(id=1,off=0,ks=4,vs=48,imm=0)
19: (55) if r0 != 0x0 goto pc+1 ; R0=0
20: (95) exit
from 19 to 21: R0=map_value(off=0,ks=4,vs=48,imm=0) R6=0 R7=0 R8=0 R9=0 R10=fp0 fp-8=mmmm????
21: (77) r6 >>= 10 ; R6_w=0
22: (27) r6 *= 8192 ; R6_w=0
23: (bf) r1 = r0 ; R0=map_value(off=0,ks=4,vs=48,imm=0) R1_w=map_value(off=0,ks=4,vs=48,imm=0)
24: (0f) r0 += r6
last_idx 24 first_idx 19
regs=40 stack=0 before 23: (bf) r1 = r0
regs=40 stack=0 before 22: (27) r6 *= 8192
regs=40 stack=0 before 21: (77) r6 >>= 10
regs=40 stack=0 before 19: (55) if r0 != 0x0 goto pc+1
parent didn't have regs=40 stack=0 marks: R0_rw=map_value_or_null(id=1,off=0,ks=4,vs=48,imm=0) R6_rw=P0 R7=0 R8=0 R9=0 R10=fp0 fp-8=mmmm????
last_idx 18 first_idx 9
regs=40 stack=0 before 18: (85) call bpf_map_lookup_elem#1
regs=40 stack=0 before 17: (07) r2 += -4
regs=40 stack=0 before 16: (bf) r2 = r10
regs=40 stack=0 before 15: (bf) r1 = r4
regs=40 stack=0 before 13: (18) r4 = 0xffff8ad3886c2a00
regs=40 stack=0 before 12: (63) *(u32 *)(r10 -4) = r0
regs=40 stack=0 before 11: (b7) r0 = 0
regs=40 stack=0 before 10: (b7) r6 = 0
25: (79) r3 = *(u64 *)(r0 +0) ; R0_w=map_value(off=0,ks=4,vs=48,imm=0) R3_w=scalar()
26: (7b) *(u64 *)(r1 +0) = r3 ; R1_w=map_value(off=0,ks=4,vs=48,imm=0) R3_w=scalar()
27: (95) exit
from 9 to 11: R1=ctx(off=0,imm=0) R6=0 R7=0 R8=0 R9=0 R10=fp0
11: (b7) r0 = 0 ; R0_w=0
12: (63) *(u32 *)(r10 -4) = r0
last_idx 12 first_idx 11
regs=1 stack=0 before 11: (b7) r0 = 0
13: R0_w=0 R10=fp0 fp-8=0000????
13: (18) r4 = 0xffff8ad3886c2a00 ; R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
15: (bf) r1 = r4 ; R1_w=map_ptr(off=0,ks=4,vs=48,imm=0) R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
16: (bf) r2 = r10 ; R2_w=fp0 R10=fp0
17: (07) r2 += -4 ; R2_w=fp-4
18: (85) call bpf_map_lookup_elem#1
frame 0: propagating r6
last_idx 19 first_idx 11
regs=40 stack=0 before 18: (85) call bpf_map_lookup_elem#1
regs=40 stack=0 before 17: (07) r2 += -4
regs=40 stack=0 before 16: (bf) r2 = r10
regs=40 stack=0 before 15: (bf) r1 = r4
regs=40 stack=0 before 13: (18) r4 = 0xffff8ad3886c2a00
regs=40 stack=0 before 12: (63) *(u32 *)(r10 -4) = r0
regs=40 stack=0 before 11: (b7) r0 = 0
parent didn't have regs=40 stack=0 marks: R1=ctx(off=0,imm=0) R6_r=P0 R7=0 R8=0 R9=0 R10=fp0
last_idx 9 first_idx 9
regs=40 stack=0 before 9: (bd) if r6 <= r9 goto pc+1
parent didn't have regs=40 stack=0 marks: R1=ctx(off=0,imm=0) R6_rw=Pscalar() R7_w=0 R8_w=0 R9_rw=0 R10=fp0
last_idx 8 first_idx 0
regs=40 stack=0 before 8: (b7) r9 = 0
regs=40 stack=0 before 7: (97) r6 %= 1
regs=40 stack=0 before 6: (bd) if r6 <= r9 goto pc+2
regs=40 stack=0 before 5: (05) goto pc+0
regs=40 stack=0 before 4: (97) r6 %= 1025
regs=40 stack=0 before 3: (b7) r9 = -2147483648
regs=40 stack=0 before 2: (b7) r8 = 0
regs=40 stack=0 before 1: (b7) r7 = 0
regs=40 stack=0 before 0: (b7) r6 = 1024
19: safe
frame 0: propagating r6
last_idx 9 first_idx 0
regs=40 stack=0 before 6: (bd) if r6 <= r9 goto pc+2
regs=40 stack=0 before 5: (05) goto pc+0
regs=40 stack=0 before 4: (97) r6 %= 1025
regs=40 stack=0 before 3: (b7) r9 = -2147483648
regs=40 stack=0 before 2: (b7) r8 = 0
regs=40 stack=0 before 1: (b7) r7 = 0
regs=40 stack=0 before 0: (b7) r6 = 1024
from 6 to 9: safe
verification time 110 usec
stack depth 4
processed 36 insns (limit 1000000) max_states_per_insn 0 total_states 3 peak_states 3 mark_read 2
The verifier considers this program as safe by mistakenly pruning unsafe
code paths. In the above func#0, code lines 0-10 are of interest. In line
0-3 registers r6 to r9 are initialized with known scalar values. In line 4
the register r6 is reset to an unknown scalar given the verifier does not
track modulo operations. Due to this, the verifier can also not determine
precisely which branches in line 6 and 9 are taken, therefore it needs to
explore them both.
As can be seen, the verifier starts with exploring the false/fall-through
paths first. The 'from 19 to 21' path has both r6=0 and r9=0 and the pointer
arithmetic on r0 += r6 is therefore considered safe. Given the arithmetic,
r6 is correctly marked for precision tracking where backtracking kicks in
where it walks back the current path all the way where r6 was set to 0 in
the fall-through branch.
Next, the pruning logics pops the path 'from 9 to 11' from the stack. Also
here, the state of the registers is the same, that is, r6=0 and r9=0, so
that at line 19 the path can be pruned as it is considered safe. It is
interesting to note that the conditional in line 9 turned r6 into a more
precise state, that is, in the fall-through path at the beginning of line
10, it is R6=scalar(umin=1), and in the branch-taken path (which is analyzed
here) at the beginning of line 11, r6 turned into a known const r6=0 as
r9=0 prior to that and therefore (unsigned) r6 <= 0 concludes that r6 must
be 0 (**):
[...] ; R6_w=scalar()
9: (bd) if r6 <= r9 goto pc+1 ; R6=scalar(umin=1) R9=0
[...]
from 9 to 11: R1=ctx(off=0,imm=0) R6=0 R7=0 R8=0 R9=0 R10=fp0
[...]
The next path is 'from 6 to 9'. The verifier considers the old and current
state equivalent, and therefore prunes the search incorrectly. Looking into
the two states which are being compared by the pruning logic at line 9, the
old state consists of R6_rwD=Pscalar() R9_rwD=0 R10=fp0 and the new state
consists of R1=ctx(off=0,imm=0) R6_w=scalar(umax=18446744071562067968)
R7_w=0 R8_w=0 R9_w=-2147483648 R10=fp0. While r6 had the reg->precise flag
correctly set in the old state, r9 did not. Both r6'es are considered as
equivalent given the old one is a superset of the current, more precise one,
however, r9's actual values (0 vs 0x80000000) mismatch. Given the old r9
did not have reg->precise flag set, the verifier does not consider the
register as contributing to the precision state of r6, and therefore it
considered both r9 states as equivalent. However, for this specific pruned
path (which is also the actual path taken at runtime), register r6 will be
0x400 and r9 0x80000000 when reaching line 21, thus oob-accessing the map.
The purpose of precision tracking is to initially mark registers (including
spilled ones) as imprecise to help verifier's pruning logic finding equivalent
states it can then prune if they don't contribute to the program's safety
aspects. For example, if registers are used for pointer arithmetic or to pass
constant length to a helper, then the verifier sets reg->precise flag and
backtracks the BPF program instruction sequence and chain of verifier states
to ensure that the given register or stack slot including their dependencies
are marked as precisely tracked scalar. This also includes any other registers
and slots that contribute to a tracked state of given registers/stack slot.
This backtracking relies on recorded jmp_history and is able to traverse
entire chain of parent states. This process ends only when all the necessary
registers/slots and their transitive dependencies are marked as precise.
The backtrack_insn() is called from the current instruction up to the first
instruction, and its purpose is to compute a bitmask of registers and stack
slots that need precision tracking in the parent's verifier state. For example,
if a current instruction is r6 = r7, then r6 needs precision after this
instruction and r7 needs precision before this instruction, that is, in the
parent state. Hence for the latter r7 is marked and r6 unmarked.
For the class of jmp/jmp32 instructions, backtrack_insn() today only looks
at call and exit instructions and for all other conditionals the masks
remain as-is. However, in the given situation register r6 has a dependency
on r9 (as described above in **), so also that one needs to be marked for
precision tracking. In other words, if an imprecise register influences a
precise one, then the imprecise register should also be marked precise.
Meaning, in the parent state both dest and src register need to be tracked
for precision and therefore the marking must be more conservative by setting
reg->precise flag for both. The precision propagation needs to cover both
for the conditional: if the src reg was marked but not the dst reg and vice
versa.
After the fix the program is correctly rejected:
func#0 @0
0: R1=ctx(off=0,imm=0) R10=fp0
0: (b7) r6 = 1024 ; R6_w=1024
1: (b7) r7 = 0 ; R7_w=0
2: (b7) r8 = 0 ; R8_w=0
3: (b7) r9 = -2147483648 ; R9_w=-2147483648
4: (97) r6 %= 1025 ; R6_w=scalar()
5: (05) goto pc+0
6: (bd) if r6 <= r9 goto pc+2 ; R6_w=scalar(umin=18446744071562067969,var_off=(0xffffffff80000000; 0x7fffffff),u32_min=-2147483648) R9_w=-2147483648
7: (97) r6 %= 1 ; R6_w=scalar()
8: (b7) r9 = 0 ; R9=0
9: (bd) if r6 <= r9 goto pc+1 ; R6=scalar(umin=1) R9=0
10: (b7) r6 = 0 ; R6_w=0
11: (b7) r0 = 0 ; R0_w=0
12: (63) *(u32 *)(r10 -4) = r0
last_idx 12 first_idx 9
regs=1 stack=0 before 11: (b7) r0 = 0
13: R0_w=0 R10=fp0 fp-8=0000????
13: (18) r4 = 0xffff9290dc5bfe00 ; R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
15: (bf) r1 = r4 ; R1_w=map_ptr(off=0,ks=4,vs=48,imm=0) R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
16: (bf) r2 = r10 ; R2_w=fp0 R10=fp0
17: (07) r2 += -4 ; R2_w=fp-4
18: (85) call bpf_map_lookup_elem#1 ; R0=map_value_or_null(id=1,off=0,ks=4,vs=48,imm=0)
19: (55) if r0 != 0x0 goto pc+1 ; R0=0
20: (95) exit
from 19 to 21: R0=map_value(off=0,ks=4,vs=48,imm=0) R6=0 R7=0 R8=0 R9=0 R10=fp0 fp-8=mmmm????
21: (77) r6 >>= 10 ; R6_w=0
22: (27) r6 *= 8192 ; R6_w=0
23: (bf) r1 = r0 ; R0=map_value(off=0,ks=4,vs=48,imm=0) R1_w=map_value(off=0,ks=4,vs=48,imm=0)
24: (0f) r0 += r6
last_idx 24 first_idx 19
regs=40 stack=0 before 23: (bf) r1 = r0
regs=40 stack=0 before 22: (27) r6 *= 8192
regs=40 stack=0 before 21: (77) r6 >>= 10
regs=40 stack=0 before 19: (55) if r0 != 0x0 goto pc+1
parent didn't have regs=40 stack=0 marks: R0_rw=map_value_or_null(id=1,off=0,ks=4,vs=48,imm=0) R6_rw=P0 R7=0 R8=0 R9=0 R10=fp0 fp-8=mmmm????
last_idx 18 first_idx 9
regs=40 stack=0 before 18: (85) call bpf_map_lookup_elem#1
regs=40 stack=0 before 17: (07) r2 += -4
regs=40 stack=0 before 16: (bf) r2 = r10
regs=40 stack=0 before 15: (bf) r1 = r4
regs=40 stack=0 before 13: (18) r4 = 0xffff9290dc5bfe00
regs=40 stack=0 before 12: (63) *(u32 *)(r10 -4) = r0
regs=40 stack=0 before 11: (b7) r0 = 0
regs=40 stack=0 before 10: (b7) r6 = 0
25: (79) r3 = *(u64 *)(r0 +0) ; R0_w=map_value(off=0,ks=4,vs=48,imm=0) R3_w=scalar()
26: (7b) *(u64 *)(r1 +0) = r3 ; R1_w=map_value(off=0,ks=4,vs=48,imm=0) R3_w=scalar()
27: (95) exit
from 9 to 11: R1=ctx(off=0,imm=0) R6=0 R7=0 R8=0 R9=0 R10=fp0
11: (b7) r0 = 0 ; R0_w=0
12: (63) *(u32 *)(r10 -4) = r0
last_idx 12 first_idx 11
regs=1 stack=0 before 11: (b7) r0 = 0
13: R0_w=0 R10=fp0 fp-8=0000????
13: (18) r4 = 0xffff9290dc5bfe00 ; R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
15: (bf) r1 = r4 ; R1_w=map_ptr(off=0,ks=4,vs=48,imm=0) R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
16: (bf) r2 = r10 ; R2_w=fp0 R10=fp0
17: (07) r2 += -4 ; R2_w=fp-4
18: (85) call bpf_map_lookup_elem#1
frame 0: propagating r6
last_idx 19 first_idx 11
regs=40 stack=0 before 18: (85) call bpf_map_lookup_elem#1
regs=40 stack=0 before 17: (07) r2 += -4
regs=40 stack=0 before 16: (bf) r2 = r10
regs=40 stack=0 before 15: (bf) r1 = r4
regs=40 stack=0 before 13: (18) r4 = 0xffff9290dc5bfe00
regs=40 stack=0 before 12: (63) *(u32 *)(r10 -4) = r0
regs=40 stack=0 before 11: (b7) r0 = 0
parent didn't have regs=40 stack=0 marks: R1=ctx(off=0,imm=0) R6_r=P0 R7=0 R8=0 R9=0 R10=fp0
last_idx 9 first_idx 9
regs=40 stack=0 before 9: (bd) if r6 <= r9 goto pc+1
parent didn't have regs=240 stack=0 marks: R1=ctx(off=0,imm=0) R6_rw=Pscalar() R7_w=0 R8_w=0 R9_rw=P0 R10=fp0
last_idx 8 first_idx 0
regs=240 stack=0 before 8: (b7) r9 = 0
regs=40 stack=0 before 7: (97) r6 %= 1
regs=40 stack=0 before 6: (bd) if r6 <= r9 goto pc+2
regs=240 stack=0 before 5: (05) goto pc+0
regs=240 stack=0 before 4: (97) r6 %= 1025
regs=240 stack=0 before 3: (b7) r9 = -2147483648
regs=40 stack=0 before 2: (b7) r8 = 0
regs=40 stack=0 before 1: (b7) r7 = 0
regs=40 stack=0 before 0: (b7) r6 = 1024
19: safe
from 6 to 9: R1=ctx(off=0,imm=0) R6_w=scalar(umax=18446744071562067968) R7_w=0 R8_w=0 R9_w=-2147483648 R10=fp0
9: (bd) if r6 <= r9 goto pc+1
last_idx 9 first_idx 0
regs=40 stack=0 before 6: (bd) if r6 <= r9 goto pc+2
regs=240 stack=0 before 5: (05) goto pc+0
regs=240 stack=0 before 4: (97) r6 %= 1025
regs=240 stack=0 before 3: (b7) r9 = -2147483648
regs=40 stack=0 before 2: (b7) r8 = 0
regs=40 stack=0 before 1: (b7) r7 = 0
regs=40 stack=0 before 0: (b7) r6 = 1024
last_idx 9 first_idx 0
regs=200 stack=0 before 6: (bd) if r6 <= r9 goto pc+2
regs=240 stack=0 before 5: (05) goto pc+0
regs=240 stack=0 before 4: (97) r6 %= 1025
regs=240 stack=0 before 3: (b7) r9 = -2147483648
regs=40 stack=0 before 2: (b7) r8 = 0
regs=40 stack=0 before 1: (b7) r7 = 0
regs=40 stack=0 before 0: (b7) r6 = 1024
11: R6=scalar(umax=18446744071562067968) R9=-2147483648
11: (b7) r0 = 0 ; R0_w=0
12: (63) *(u32 *)(r10 -4) = r0
last_idx 12 first_idx 11
regs=1 stack=0 before 11: (b7) r0 = 0
13: R0_w=0 R10=fp0 fp-8=0000????
13: (18) r4 = 0xffff9290dc5bfe00 ; R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
15: (bf) r1 = r4 ; R1_w=map_ptr(off=0,ks=4,vs=48,imm=0) R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
16: (bf) r2 = r10 ; R2_w=fp0 R10=fp0
17: (07) r2 += -4 ; R2_w=fp-4
18: (85) call bpf_map_lookup_elem#1 ; R0_w=map_value_or_null(id=3,off=0,ks=4,vs=48,imm=0)
19: (55) if r0 != 0x0 goto pc+1 ; R0_w=0
20: (95) exit
from 19 to 21: R0=map_value(off=0,ks=4,vs=48,imm=0) R6=scalar(umax=18446744071562067968) R7=0 R8=0 R9=-2147483648 R10=fp0 fp-8=mmmm????
21: (77) r6 >>= 10 ; R6_w=scalar(umax=18014398507384832,var_off=(0x0; 0x3fffffffffffff))
22: (27) r6 *= 8192 ; R6_w=scalar(smax=9223372036854767616,umax=18446744073709543424,var_off=(0x0; 0xffffffffffffe000),s32_max=2147475456,u32_max=-8192)
23: (bf) r1 = r0 ; R0=map_value(off=0,ks=4,vs=48,imm=0) R1_w=map_value(off=0,ks=4,vs=48,imm=0)
24: (0f) r0 += r6
last_idx 24 first_idx 21
regs=40 stack=0 before 23: (bf) r1 = r0
regs=40 stack=0 before 22: (27) r6 *= 8192
regs=40 stack=0 before 21: (77) r6 >>= 10
parent didn't have regs=40 stack=0 marks: R0_rw=map_value(off=0,ks=4,vs=48,imm=0) R6_r=Pscalar(umax=18446744071562067968) R7=0 R8=0 R9=-2147483648 R10=fp0 fp-8=mmmm????
last_idx 19 first_idx 11
regs=40 stack=0 before 19: (55) if r0 != 0x0 goto pc+1
regs=40 stack=0 before 18: (85) call bpf_map_lookup_elem#1
regs=40 stack=0 before 17: (07) r2 += -4
regs=40 stack=0 before 16: (bf) r2 = r10
regs=40 stack=0 before 15: (bf) r1 = r4
regs=40 stack=0 before 13: (18) r4 = 0xffff9290dc5bfe00
regs=40 stack=0 before 12: (63) *(u32 *)(r10 -4) = r0
regs=40 stack=0 before 11: (b7) r0 = 0
parent didn't have regs=40 stack=0 marks: R1=ctx(off=0,imm=0) R6_rw=Pscalar(umax=18446744071562067968) R7_w=0 R8_w=0 R9_w=-2147483648 R10=fp0
last_idx 9 first_idx 0
regs=40 stack=0 before 9: (bd) if r6 <= r9 goto pc+1
regs=240 stack=0 before 6: (bd) if r6 <= r9 goto pc+2
regs=240 stack=0 before 5: (05) goto pc+0
regs=240 stack=0 before 4: (97) r6 %= 1025
regs=240 stack=0 before 3: (b7) r9 = -2147483648
regs=40 stack=0 before 2: (b7) r8 = 0
regs=40 stack=0 before 1: (b7) r7 = 0
regs=40 stack=0 before 0: (b7) r6 = 1024
math between map_value pointer and register with unbounded min value is not allowed
verification time 886 usec
stack depth 4
processed 49 insns (limit 1000000) max_states_per_insn 1 total_states 5 peak_states 5 mark_read 2
Fixes: b5dc0163d8fd ("bpf: precise scalar_value tracking")
Reported-by: Juan Jose Lopez Jaimez <jjlopezjaimez@google.com>
Reported-by: Meador Inge <meadori@google.com>
Reported-by: Simon Scannell <simonscannell@google.com>
Reported-by: Nenad Stojanovski <thenenadx@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Co-developed-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Reviewed-by: Juan Jose Lopez Jaimez <jjlopezjaimez@google.com>
Reviewed-by: Meador Inge <meadori@google.com>
Reviewed-by: Simon Scannell <simonscannell@google.com>
|
|
Delay accounting does not track the delay of IRQ/SOFTIRQ. While
IRQ/SOFTIRQ could have obvious impact on some workloads productivity, such
as when workloads are running on system which is busy handling network
IRQ/SOFTIRQ.
Get the delay of IRQ/SOFTIRQ could help users to reduce such delay. Such
as setting interrupt affinity or task affinity, using kernel thread for
NAPI etc. This is inspired by "sched/psi: Add PSI_IRQ to track
IRQ/SOFTIRQ pressure"[1]. Also fix some code indent problems of older
code.
And update tools/accounting/getdelays.c:
/ # ./getdelays -p 156 -di
print delayacct stats ON
printing IO accounting
PID 156
CPU count real total virtual total delay total delay average
15 15836008 16218149 275700790 18.380ms
IO count delay total delay average
0 0 0.000ms
SWAP count delay total delay average
0 0 0.000ms
RECLAIM count delay total delay average
0 0 0.000ms
THRASHING count delay total delay average
0 0 0.000ms
COMPACT count delay total delay average
0 0 0.000ms
WPCOPY count delay total delay average
36 7586118 0.211ms
IRQ count delay total delay average
42 929161 0.022ms
[1] commit 52b1364ba0b1("sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure")
Link: https://lkml.kernel.org/r/202304081728353557233@zte.com.cn
Signed-off-by: Yang Yang <yang.yang29@zte.com.cn>
Cc: Jiang Xuexin <jiang.xuexin@zte.com.cn>
Cc: wangyong <wang.yong12@zte.com.cn>
Cc: junhua huang <huang.junhua@zte.com.cn>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The console tracepoint is used by kcsan/kasan/kfence/kmsan test modules.
Since this tracepoint is not exported, these modules iterate over all
available tracepoints to find the console trace point. Export the trace
point so that it can be directly used.
Link: https://lkml.kernel.org/r/20230413100859.1492323-1-quic_pkondeti@quicinc.com
Signed-off-by: Pavankumar Kondeti <quic_pkondeti@quicinc.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Marco Elver <elver@google.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
If a library wants to get information from auxv (for instance,
AT_HWCAP/AT_HWCAP2), it has a few options, none of them perfectly reliable
or ideal:
- Be main or the pre-main startup code, and grub through the stack above
main. Doesn't work for a library.
- Call libc getauxval. Not ideal for libraries that are trying to be
libc-independent and/or don't otherwise require anything from other
libraries.
- Open and read /proc/self/auxv. Doesn't work for libraries that may run
in arbitrarily constrained environments that may not have /proc
mounted (e.g. libraries that might be used by an init program or a
container setup tool).
- Assume you're on the main thread and still on the original stack, and
try to walk the stack upwards, hoping to find auxv. Extremely bad
idea.
- Ask the caller to pass auxv in for you. Not ideal for a user-friendly
library, and then your caller may have the same problem.
Add a prctl that copies current->mm->saved_auxv to a userspace buffer.
Link: https://lkml.kernel.org/r/d81864a7f7f43bca6afa2a09fc2e850e4050ab42.1680611394.git.josh@joshtriplett.org
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "memcg: avoid flushing stats atomically where possible", v3.
rstat flushing is an expensive operation that scales with the number of
cpus and the number of cgroups in the system. The purpose of this series
is to minimize the contexts where we flush stats atomically.
Patches 1 and 2 are cleanups requested during reviews of prior versions of
this series.
Patch 3 makes sure we never try to flush from within an irq context.
Patches 4 to 7 introduce separate variants of mem_cgroup_flush_stats() for
atomic and non-atomic flushing, and make sure we only flush the stats
atomically when necessary.
Patch 8 is a slightly tangential optimization that limits the work done by
rstat flushing in some scenarios.
This patch (of 8):
cgroup_rstat_flush_irqsafe() can be a confusing name. It may read as
"irqs are disabled throughout", which is what the current implementation
does (currently under discussion [1]), but is not the intention. The
intention is that this function is safe to call from atomic contexts.
Name it as such.
Link: https://lkml.kernel.org/r/20230330191801.1967435-1-yosryahmed@google.com
Link: https://lkml.kernel.org/r/20230330191801.1967435-2-yosryahmed@google.com
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vasily Averin <vasily.averin@linux.dev>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
commit f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter")
introduces a memory leak by missing a call to destroy_context() when a
percpu_counter fails to allocate.
Before introducing the per-cpu counter allocations, init_new_context() was
the last call that could fail in mm_init(), and thus there was no need to
ever invoke destroy_context() in the error paths. Adding the following
percpu counter allocations adds error paths after init_new_context(),
which means its associated destroy_context() needs to be called when
percpu counters fail to allocate.
Link: https://lkml.kernel.org/r/20230330133822.66271-1-mathieu.desnoyers@efficios.com
Fixes: f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter")
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Linux Security Modules (LSMs) that implement the "capable" hook will
usually emit an access denial message to the audit log whenever they
"block" the current task from using the given capability based on their
security policy.
The occurrence of a denial is used as an indication that the given task
has attempted an operation that requires the given access permission, so
the callers of functions that perform LSM permission checks must take care
to avoid calling them too early (before it is decided if the permission is
actually needed to perform the requested operation).
The __sys_setres[ug]id() functions violate this convention by first
calling ns_capable_setid() and only then checking if the operation
requires the capability or not. It means that any caller that has the
capability granted by DAC (task's capability set) but not by MAC (LSMs)
will generate a "denied" audit record, even if is doing an operation for
which the capability is not required.
Fix this by reordering the checks such that ns_capable_setid() is checked
last and -EPERM is returned immediately if it returns false.
While there, also do two small optimizations:
* move the capability check before prepare_creds() and
* bail out early in case of a no-op.
Link: https://lkml.kernel.org/r/20230217162154.837549-1-omosnace@redhat.com
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This was caught by randconfig builds but does not show up in
build testing without CONFIG_MODULE_DECOMPRESS:
kernel/module/stats.c: In function 'mod_stat_bump_invalid':
kernel/module/stats.c:229:42: error: 'invalid_mod_byte' undeclared (first use in this function); did you mean 'invalid_mod_bytes'?
229 | atomic_long_add(info->compressed_len, &invalid_mod_byte);
| ^~~~~~~~~~~~~~~~
| invalid_mod_bytes
Fixes: df3e764d8e5c ("module: add debug stats to help identify memory pressure")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
|
|
clang build reports
kernel/module/stats.c:307:34: error: variable
'len' is uninitialized when used here [-Werror,-Wuninitialized]
len = scnprintf(buf + 0, size - len,
^~~
At the start of this sequence, neither the '+ 0', nor the '- len' are needed.
So remove them and fix using 'len' uninitalized.
Fixes: df3e764d8e5c ("module: add debug stats to help identify memory pressure")
Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
|
|
The new module statistics code mixes 64-bit types and wordsized 'long'
variables, which leads to build failures on 32-bit architectures:
kernel/module/stats.c: In function 'read_file_mod_stats':
kernel/module/stats.c:291:29: error: passing argument 1 of 'atomic64_read' from incompatible pointer type [-Werror=incompatible-pointer-types]
291 | total_size = atomic64_read(&total_mod_size);
x86_64-linux-ld: kernel/module/stats.o: in function `read_file_mod_stats':
stats.c:(.text+0x2b2): undefined reference to `__udivdi3'
To fix this, the code has to use one of the two types consistently.
Change them all to word-size types here.
Fixes: df3e764d8e5c ("module: add debug stats to help identify memory pressure")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
|
|
MODULE_INIT_COMPRESSED_FILE is defined in the uapi header, which
is not included indirectly from the normal linux/module.h, but
has to be pulled in explicitly:
kernel/module/stats.c: In function 'mod_stat_bump_invalid':
kernel/module/stats.c:227:14: error: 'MODULE_INIT_COMPRESSED_FILE' undeclared (first use in this function)
227 | if (flags & MODULE_INIT_COMPRESSED_FILE)
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
|
|
The finit_module() system call can create unnecessary virtual memory
pressure for duplicate modules. This is because load_module() can in
the worse case allocate more than twice the size of a module in virtual
memory. This saves at least a full size of the module in wasted vmalloc
space memory by trying to avoid duplicates as soon as we can validate
the module name in the read module structure.
This can only be an issue if a system is getting hammered with userspace
loading modules. There are two ways to load modules typically on systems,
one is the kernel moduile auto-loading (*request_module*() calls in-kernel)
and the other is things like udev. The auto-loading is in-kernel, but that
pings back to userspace to just call modprobe. We already have a way to
restrict the amount of concurrent kernel auto-loads in a given time, however
that still allows multiple requests for the same module to go through
and force two threads in userspace racing to call modprobe for the same
exact module. Even though libkmod which both modprobe and udev does check
if a module is already loaded prior calling finit_module() races are
still possible and this is clearly evident today when you have multiple
CPUs.
To avoid memory pressure for such stupid cases put a stop gap for them.
The *earliest* we can detect duplicates from the modules side of things
is once we have blessed the module name, sadly after the first vmalloc
allocation. We can check for the module being present *before* a secondary
vmalloc() allocation.
There is a linear relationship between wasted virtual memory bytes and
the number of CPU counts. The reason is that udev ends up racing to call
tons of the same modules for each of the CPUs.
We can see the different linear relationships between wasted virtual
memory and CPU count during after boot in the following graph:
+----------------------------------------------------------------------------+
14GB |-+ + + + + *+ +-|
| **** |
| *** |
| ** |
12GB |-+ ** +-|
| ** |
| ** |
| ** |
| ** |
10GB |-+ ** +-|
| ** |
| ** |
| ** |
8GB |-+ ** +-|
waste | ** ### |
| ** #### |
| ** ####### |
6GB |-+ **** #### +-|
| * #### |
| * #### |
| ***** #### |
4GB |-+ ** #### +-|
| ** #### |
| ** #### |
| ** #### |
2GB |-+ ** ##### +-|
| * #### |
| * #### Before ******* |
| **## + + + + After ####### |
+----------------------------------------------------------------------------+
0 50 100 150 200 250 300
CPUs count
On the y-axis we can see gigabytes of wasted virtual memory during boot
due to duplicate module requests which just end up failing. Trying to
infer the slope this ends up being about ~463 MiB per CPU lost prior
to this patch. After this patch we only loose about ~230 MiB per CPU, for
a total savings of about ~233 MiB per CPU. This is all *just on bootup*!
On a 8vcpu 8 GiB RAM system using kdevops and testing against selftests
kmod.sh -t 0008 I see a saving in the *highest* side of memory
consumption of up to ~ 84 MiB with the Linux kernel selftests kmod
test 0008. With the new stress-ng module test I see a 145 MiB difference
in max memory consumption with 100 ops. The stress-ng module ops tests can be
pretty pathalogical -- it is not realistic, however it was used to
finally successfully reproduce issues which are only reported to happen on
system with over 400 CPUs [0] by just usign 100 ops on a 8vcpu 8 GiB RAM
system. Running out of virtual memory space is no surprise given the
above graph, since at least on x86_64 we're capped at 128 MiB, eventually
we'd hit a series of errors and once can use the above graph to
guestimate when. This of course will vary depending on the features
you have enabled. So for instance, enabling KASAN seems to make this
much worse.
The results with kmod and stress-ng can be observed and visualized below.
The time it takes to run the test is also not affected.
The kmod tests 0008:
The gnuplot is set to a range from 400000 KiB (390 Mib) - 580000 (566 Mib)
given the tests peak around that range.
cat kmod.plot
set term dumb
set output fileout
set yrange [400000:580000]
plot filein with linespoints title "Memory usage (KiB)"
Before:
root@kmod ~ # /data/linux-next/tools/testing/selftests/kmod/kmod.sh -t 0008
root@kmod ~ # free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > log-0008-before.txt ^C
root@kmod ~ # sort -n -r log-0008-before.txt | head -1
528732
So ~516.33 MiB
After:
root@kmod ~ # /data/linux-next/tools/testing/selftests/kmod/kmod.sh -t 0008
root@kmod ~ # free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > log-0008-after.txt ^C
root@kmod ~ # sort -n -r log-0008-after.txt | head -1
442516
So ~432.14 MiB
That's about 84 ~MiB in savings in the worst case. The graphs:
root@kmod ~ # gnuplot -e "filein='log-0008-before.txt'; fileout='graph-0008-before.txt'" kmod.plot
root@kmod ~ # gnuplot -e "filein='log-0008-after.txt'; fileout='graph-0008-after.txt'" kmod.plot
root@kmod ~ # cat graph-0008-before.txt
580000 +-----------------------------------------------------------------+
| + + + + + + + |
560000 |-+ Memory usage (KiB) ***A***-|
| |
540000 |-+ +-|
| |
| *A *AA*AA*A*AA *A*AA A*A*A *AA*A*AA*A A |
520000 |-+A*A*AA *AA*A *A*AA*A*AA *A*A A *A+-|
|*A |
500000 |-+ +-|
| |
480000 |-+ +-|
| |
460000 |-+ +-|
| |
| |
440000 |-+ +-|
| |
420000 |-+ +-|
| + + + + + + + |
400000 +-----------------------------------------------------------------+
0 5 10 15 20 25 30 35 40
root@kmod ~ # cat graph-0008-after.txt
580000 +-----------------------------------------------------------------+
| + + + + + + + |
560000 |-+ Memory usage (KiB) ***A***-|
| |
540000 |-+ +-|
| |
| |
520000 |-+ +-|
| |
500000 |-+ +-|
| |
480000 |-+ +-|
| |
460000 |-+ +-|
| |
| *A *A*A |
440000 |-+A*A*AA*A A A*A*AA A*A*AA*A*AA*A*AA*A*AA*AA*A*AA*A*AA-|
|*A *A*AA*A |
420000 |-+ +-|
| + + + + + + + |
400000 +-----------------------------------------------------------------+
0 5 10 15 20 25 30 35 40
The stress-ng module tests:
This is used to run the test to try to reproduce the vmap issues
reported by David:
echo 0 > /proc/sys/vm/oom_dump_tasks
./stress-ng --module 100 --module-name xfs
Prior to this commit:
root@kmod ~ # free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > baseline-stress-ng.txt
root@kmod ~ # sort -n -r baseline-stress-ng.txt | head -1
5046456
After this commit:
root@kmod ~ # free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > after-stress-ng.txt
root@kmod ~ # sort -n -r after-stress-ng.txt | head -1
4896972
5046456 - 4896972
149484
149484/1024
145.98046875000000000000
So this commit using stress-ng reveals saving about 145 MiB in memory
using 100 ops from stress-ng which reproduced the vmap issue reported.
cat kmod.plot
set term dumb
set output fileout
set yrange [4700000:5070000]
plot filein with linespoints title "Memory usage (KiB)"
root@kmod ~ # gnuplot -e "filein='baseline-stress-ng.txt'; fileout='graph-stress-ng-before.txt'" kmod-simple-stress-ng.plot
root@kmod ~ # gnuplot -e "filein='after-stress-ng.txt'; fileout='graph-stress-ng-after.txt'" kmod-simple-stress-ng.plot
root@kmod ~ # cat graph-stress-ng-before.txt
+---------------------------------------------------------------+
5.05e+06 |-+ + A + + + + + + +-|
| * Memory usage (KiB) ***A*** |
| * A |
5e+06 |-+ ** ** +-|
| ** * * A |
4.95e+06 |-+ * * A * A* +-|
| * * A A * * * * A |
| * * * * * * *A * * * A * |
4.9e+06 |-+ * * * A*A * A*AA*A A *A **A **A*A *+-|
| A A*A A * A * * A A * A * ** |
| * ** ** * * * * * * * |
4.85e+06 |-+ A A A ** * * ** *-|
| * * * * ** * |
| * A * * * * |
4.8e+06 |-+ * * * A A-|
| * * * |
4.75e+06 |-+ * * * +-|
| * ** |
| * + + + + + + ** + |
4.7e+06 +---------------------------------------------------------------+
0 5 10 15 20 25 30 35 40
root@kmod ~ # cat graph-stress-ng-after.txt
+---------------------------------------------------------------+
5.05e+06 |-+ + + + + + + + +-|
| Memory usage (KiB) ***A*** |
| |
5e+06 |-+ +-|
| |
4.95e+06 |-+ +-|
| |
| |
4.9e+06 |-+ *AA +-|
| A*AA*A*A A A*AA*AA*A*AA*A A A A*A *AA*A*A A A*AA*AA |
| * * ** * * * ** * *** * |
4.85e+06 |-+* *** * * * * *** A * * +-|
| * A * * ** * * A * * |
| * * * * ** * * |
4.8e+06 |-+* * * A * * * +-|
| * * * A * * |
4.75e+06 |-* * * * * +-|
| * * * * * |
| * + * *+ + + + + * *+ |
4.7e+06 +---------------------------------------------------------------+
0 5 10 15 20 25 30 35 40
[0] https://lkml.kernel.org/r/20221013180518.217405-1-david@redhat.com
Reported-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
|
|
Loading modules with finit_module() can end up using vmalloc(), vmap()
and vmalloc() again, for a total of up to 3 separate allocations in the
worst case for a single module. We always kernel_read*() the module,
that's a vmalloc(). Then vmap() is used for the module decompression,
and if so the last read buffer is freed as we use the now decompressed
module buffer to stuff data into our copy module. The last allocation is
specific to each architectures but pretty much that's generally a series
of vmalloc() calls or a variation of vmalloc to handle ELF sections with
special permissions.
Evaluation with new stress-ng module support [1] with just 100 ops
is proving that you can end up using GiBs of data easily even with all
care we have in the kernel and userspace today in trying to not load modules
which are already loaded. 100 ops seems to resemble the sort of pressure a
system with about 400 CPUs can create on module loading. Although issues
relating to duplicate module requests due to each CPU inucurring a new
module reuest is silly and some of these are being fixed, we currently lack
proper tooling to help diagnose easily what happened, when it happened
and who likely is to blame -- userspace or kernel module autoloading.
Provide an initial set of stats which use debugfs to let us easily scrape
post-boot information about failed loads. This sort of information can
be used on production worklaods to try to optimize *avoiding* redundant
memory pressure using finit_module().
There's a few examples that can be provided:
A 255 vCPU system without the next patch in this series applied:
Startup finished in 19.143s (kernel) + 7.078s (userspace) = 26.221s
graphical.target reached after 6.988s in userspace
And 13.58 GiB of virtual memory space lost due to failed module loading:
root@big ~ # cat /sys/kernel/debug/modules/stats
Mods ever loaded 67
Mods failed on kread 0
Mods failed on decompress 0
Mods failed on becoming 0
Mods failed on load 1411
Total module size 11464704
Total mod text size 4194304
Failed kread bytes 0
Failed decompress bytes 0
Failed becoming bytes 0
Failed kmod bytes 14588526272
Virtual mem wasted bytes 14588526272
Average mod size 171115
Average mod text size 62602
Average fail load bytes 10339140
Duplicate failed modules:
module-name How-many-times Reason
kvm_intel 249 Load
kvm 249 Load
irqbypass 8 Load
crct10dif_pclmul 128 Load
ghash_clmulni_intel 27 Load
sha512_ssse3 50 Load
sha512_generic 200 Load
aesni_intel 249 Load
crypto_simd 41 Load
cryptd 131 Load
evdev 2 Load
serio_raw 1 Load
virtio_pci 3 Load
nvme 3 Load
nvme_core 3 Load
virtio_pci_legacy_dev 3 Load
virtio_pci_modern_dev 3 Load
t10_pi 3 Load
virtio 3 Load
crc32_pclmul 6 Load
crc64_rocksoft 3 Load
crc32c_intel 40 Load
virtio_ring 3 Load
crc64 3 Load
The following screen shot, of a simple 8vcpu 8 GiB KVM guest with the
next patch in this series applied, shows 226.53 MiB are wasted in virtual
memory allocations which due to duplicate module requests during boot.
It also shows an average module memory size of 167.10 KiB and an an
average module .text + .init.text size of 61.13 KiB. The end shows all
modules which were detected as duplicate requests and whether or not
they failed early after just the first kernel_read*() call or late after
we've already allocated the private space for the module in
layout_and_allocate(). A system with module decompression would reveal
more wasted virtual memory space.
We should put effort now into identifying the source of these duplicate
module requests and trimming these down as much possible. Larger systems
will obviously show much more wasted virtual memory allocations.
root@kmod ~ # cat /sys/kernel/debug/modules/stats
Mods ever loaded 67
Mods failed on kread 0
Mods failed on decompress 0
Mods failed on becoming 83
Mods failed on load 16
Total module size 11464704
Total mod text size 4194304
Failed kread bytes 0
Failed decompress bytes 0
Failed becoming bytes 228959096
Failed kmod bytes 8578080
Virtual mem wasted bytes 237537176
Average mod size 171115
Average mod text size 62602
Avg fail becoming bytes 2758544
Average fail load bytes 536130
Duplicate failed modules:
module-name How-many-times Reason
kvm_intel 7 Becoming
kvm 7 Becoming
irqbypass 6 Becoming & Load
crct10dif_pclmul 7 Becoming & Load
ghash_clmulni_intel 7 Becoming & Load
sha512_ssse3 6 Becoming & Load
sha512_generic 7 Becoming & Load
aesni_intel 7 Becoming
crypto_simd 7 Becoming & Load
cryptd 3 Becoming & Load
evdev 1 Becoming
serio_raw 1 Becoming
nvme 3 Becoming
nvme_core 3 Becoming
t10_pi 3 Becoming
virtio_pci 3 Becoming
crc32_pclmul 6 Becoming & Load
crc64_rocksoft 3 Becoming
crc32c_intel 3 Becoming
virtio_pci_modern_dev 2 Becoming
virtio_pci_legacy_dev 1 Becoming
crc64 2 Becoming
virtio 2 Becoming
virtio_ring 2 Becoming
[0] https://github.com/ColinIanKing/stress-ng.git
[1] echo 0 > /proc/sys/vm/oom_dump_tasks
./stress-ng --module 100 --module-name xfs
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
|
|
The patient module check inside add_unformed_module() is large
enough as we need it. It is a bit hard to read too, so just
move it to a helper and do the inverse checks first to help
shift the code and make it easier to read. The new helper then
is module_patient_check_exists().
To make this work we need to mvoe the finished_loading() up,
we do that without making any functional changes to that routine.
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
|
|
Simplify the concurrency delimiter we use for kmod with the semaphore.
I had used the kmod strategy to try to implement a similar concurrency
delimiter for the kernel_read*() calls from the finit_module() path
so to reduce vmalloc() memory pressure. That effort didn't provide yet
conclusive results, but one thing that became clear is we can use
the suggested alternative solution with semaphores which Linus hinted
at instead of using the atomic / wait strategy.
I've stress tested this with kmod test 0008:
time /data/linux-next/tools/testing/selftests/kmod/kmod.sh -t 0008
And I get only a *slight* delay. That delay however is small, a few
seconds for a full test loop run that runs 150 times, for about ~30-40
seconds. The small delay is worth the simplfication IMHO.
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
|
|
Fundamentally semaphores are a counted primitive, but
DEFINE_SEMAPHORE() does not expose this and explicitly creates a
binary semaphore.
Change DEFINE_SEMAPHORE() to take a number argument and use that in the
few places that open-coded it using __SEMAPHORE_INITIALIZER().
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[mcgrof: add some tribal knowledge about why some folks prefer
binary sempahores over mutexes]
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
|
|
There is no need for the __tick_nohz_idle_stop_tick() function between
tick_nohz_idle_stop_tick() and its implementation. Remove that
unnecessary step.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230222144649.624380-6-frederic@kernel.org
|
|
The per-cpu iowait task counter is incremented locally upon sleeping.
But since the task can be woken to (and by) another CPU, the counter may
then be decremented remotely. This is the source of a race involving
readers VS writer of idle/iowait sleeptime.
The following scenario shows an example where a /proc/stat reader
observes a pending sleep time as IO whereas that pending sleep time
later eventually gets accounted as non-IO.
CPU 0 CPU 1 CPU 2
----- ----- ------
//io_schedule() TASK A
current->in_iowait = 1
rq(0)->nr_iowait++
//switch to idle
// READ /proc/stat
// See nr_iowait_cpu(0) == 1
return ts->iowait_sleeptime +
ktime_sub(ktime_get(), ts->idle_entrytime)
//try_to_wake_up(TASK A)
rq(0)->nr_iowait--
//idle exit
// See nr_iowait_cpu(0) == 0
ts->idle_sleeptime += ktime_sub(ktime_get(), ts->idle_entrytime)
As a result subsequent reads on /proc/stat may expose backward progress.
This is unfortunately hardly fixable. Just add a comment about that
condition.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230222144649.624380-5-frederic@kernel.org
|
|
Reading idle/IO sleep time (eg: from /proc/stat) can race with idle exit
updates because the state machine handling the stats is not atomic and
requires a coherent read batch.
As a result reading the sleep time may report irrelevant or backward
values.
Fix this with protecting the simple state machine within a seqcount.
This is expected to be cheap enough not to add measurable performance
impact on the idle path.
Note this only fixes reader VS writer condition partitially. A race
remains that involves remote updates of the CPU iowait task counter. It
can hardly be fixed.
Reported-by: Yu Liao <liaoyu15@huawei.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230222144649.624380-4-frederic@kernel.org
|