Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from netfiler, xfrm and bluetooth.
Oddly this includes a fix for a posix clock regression; in our
previous PR we included a change there as a pre-requisite for
networking one. That fix proved to be buggy and requires the follow-up
included here. Thomas suggested we should send it, given we sent the
buggy patch.
Current release - regressions:
- posix-clock: Fix unbalanced locking in pc_clock_settime()
- netfilter: fix typo causing some targets not to load on IPv6
Current release - new code bugs:
- xfrm: policy: remove last remnants of pernet inexact list
Previous releases - regressions:
- core: fix races in netdev_tx_sent_queue()/dev_watchdog()
- bluetooth: fix UAF on sco_sock_timeout
- eth: hv_netvsc: fix VF namespace also in synthetic NIC
NETDEV_REGISTER event
- eth: usbnet: fix name regression
- eth: be2net: fix potential memory leak in be_xmit()
- eth: plip: fix transmit path breakage
Previous releases - always broken:
- sched: deny mismatched skip_sw/skip_hw flags for actions created by
classifiers
- netfilter: bpf: must hold reference on net namespace
- eth: virtio_net: fix integer overflow in stats
- eth: bnxt_en: replace ptp_lock with irqsave variant
- eth: octeon_ep: add SKB allocation failures handling in
__octep_oq_process_rx()
Misc:
- MAINTAINERS: add Simon as an official reviewer"
* tag 'net-6.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (40 commits)
net: dsa: mv88e6xxx: support 4000ps cycle counter period
net: dsa: mv88e6xxx: read cycle counter period from hardware
net: dsa: mv88e6xxx: group cycle counter coefficients
net: usb: qmi_wwan: add Fibocom FG132 0x0112 composition
hv_netvsc: Fix VF namespace also in synthetic NIC NETDEV_REGISTER event
net: dsa: microchip: disable EEE for KSZ879x/KSZ877x/KSZ876x
Bluetooth: ISO: Fix UAF on iso_sock_timeout
Bluetooth: SCO: Fix UAF on sco_sock_timeout
Bluetooth: hci_core: Disable works on hci_unregister_dev
posix-clock: posix-clock: Fix unbalanced locking in pc_clock_settime()
r8169: avoid unsolicited interrupts
net: sched: use RCU read-side critical section in taprio_dump()
net: sched: fix use-after-free in taprio_change()
net/sched: act_api: deny mismatched skip_sw/skip_hw flags for actions created by classifiers
net: usb: usbnet: fix name regression
mlxsw: spectrum_router: fix xa_store() error checking
virtio_net: fix integer overflow in stats
net: fix races in netdev_tx_sent_queue()/dev_watchdog()
net: wwan: fix global oob in wwan_rtnl_policy
netfilter: xtables: fix typo causing some targets not to load on IPv6
...
|
|
Going through the RCU-boost and rtmutex code, I ran into this utterly
confusing comment. Fix it to avoid confusing future readers.
[ tglx: Wordsmithed the comment ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/all/20241008092606.GJ33184@noisy.programming.kicks-ass.net
|
|
This patch adds uptr support in the map_value of the task local storage.
struct map_value {
struct user_data __uptr *uptr;
};
struct {
__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
__uint(map_flags, BPF_F_NO_PREALLOC);
__type(key, int);
__type(value, struct value_type);
} datamap SEC(".maps");
A new bpf_obj_pin_uptrs() is added to pin the user page and
also stores the kernel address back to the uptr for the
bpf prog to use later. It currently does not support
the uptr pointing to a user struct across two pages.
It also excludes PageHighMem support to keep it simple.
As of now, the 32bit bpf jit is missing other more crucial bpf
features. For example, many important bpf features depend on
bpf kfunc now but so far only one arch (x86-32) supports it
which was added by me as an example when kfunc was first
introduced to bpf.
The uptr can only be stored to the task local storage by the
syscall update_elem. Meaning the uptr will not be considered
if it is provided by the bpf prog through
bpf_task_storage_get(BPF_LOCAL_STORAGE_GET_F_CREATE).
This is enforced by only calling
bpf_local_storage_update(swap_uptrs==true) in
bpf_pid_task_storage_update_elem. Everywhere else will
have swap_uptrs==false.
This will pump down to bpf_selem_alloc(swap_uptrs==true). It is
the only case that bpf_selem_alloc() will take the uptr value when
updating the newly allocated selem. bpf_obj_swap_uptrs() is added
to swap the uptr between the SDATA(selem)->data and the user provided
map_value in "void *value". bpf_obj_swap_uptrs() makes the
SDATA(selem)->data takes the ownership of the uptr and the user space
provided map_value will have NULL in the uptr.
The bpf_obj_unpin_uptrs() is called after map->ops->map_update_elem()
returning error. If the map->ops->map_update_elem has reached
a state that the local storage has taken the uptr ownership,
the bpf_obj_unpin_uptrs() will be a no op because the uptr
is NULL. A "__"bpf_obj_unpin_uptrs is added to make this
error path unpin easier such that it does not have to check
the map->record is NULL or not.
BPF_F_LOCK is not supported when the map_value has uptr.
This can be revisited later if there is a use case. A similar
swap_uptrs idea can be considered.
The final bit is to do unpin_user_page in the bpf_obj_free_fields().
The earlier patch has ensured that the bpf_obj_free_fields() has
gone through the rcu gp when needed.
Cc: linux-mm@kvack.org
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Link: https://lore.kernel.org/r/20241023234759.860539-7-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
A later patch will enable the uptr usage in the task_local_storage map.
This will require the unpin_user_page() to be done after the rcu
task trace gp for the cases that the uptr may still be used by
a bpf prog. The bpf_obj_free_fields() will be the one doing
unpin_user_page(), so this patch is to postpone calling
bpf_obj_free_fields() to the rcu callback.
The bpf_obj_free_fields() is only required to be done in
the rcu callback when bpf->bpf_ma==true and reuse_now==false.
bpf->bpf_ma==true case is because uptr will only be enabled
in task storage which has already been moved to bpf_mem_alloc.
The bpf->bpf_ma==false case can be supported in the future
also if there is a need.
reuse_now==false when the selem (aka storage) is deleted
by bpf prog (bpf_task_storage_delete) or by syscall delete_elem().
In both cases, bpf_obj_free_fields() needs to wait for
rcu gp.
A few words on reuse_now==true. reuse_now==true when the
storage's owner (i.e. the task_struct) is destructing or the map
itself is doing map_free(). In both cases, no bpf prog should
have a hold on the selem and its uptrs, so there is no need to
postpone bpf_obj_free_fields(). reuse_now==true should be the
common case for local storage usage where the storage exists
throughout the lifetime of its owner (task_struct).
The bpf_obj_free_fields() needs to use the map->record. Doing
bpf_obj_free_fields() in a rcu callback will require the
bpf_local_storage_map_free() to wait for rcu_barrier. An optimization
could be only waiting for rcu_barrier when the map has uptr in
its map_value. This will require either yet another rcu callback
function or adding a bool in the selem to flag if the SDATA(selem)->smap
is still valid. This patch chooses to keep it simple and wait for
rcu_barrier for maps that use bpf_mem_alloc.
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20241023234759.860539-6-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
In a later patch, bpf_selem_free() will call unpin_user_page()
through bpf_obj_free_fields(). unpin_user_page() may take spin_lock.
However, some bpf_selem_free() call paths have held a raw_spin_lock.
Like this:
raw_spin_lock_irqsave()
bpf_selem_unlink_storage_nolock()
bpf_selem_free()
unpin_user_page()
spin_lock()
To avoid spinlock nested in raw_spinlock, bpf_selem_free() should be
done after releasing the raw_spinlock. The "bool reuse_now" arg is
replaced with "struct hlist_head *free_selem_list" in
bpf_selem_unlink_storage_nolock(). The bpf_selem_unlink_storage_nolock()
will append the to-be-free selem at the free_selem_list. The caller of
bpf_selem_unlink_storage_nolock() will need to call the new
bpf_selem_free_list(free_selem_list, reuse_now) to free the selem
after releasing the raw_spinlock.
Note that the selem->snode cannot be reused for linking to
the free_selem_list because the selem->snode is protected by the
raw_spinlock that we want to avoid holding. A new
"struct hlist_node free_node;" is union-ized with
the rcu_head. Only the first one successfully
hlist_del_init_rcu(&selem->snode) will be able
to use the free_node. After succeeding hlist_del_init_rcu(&selem->snode),
the free_node and rcu_head usage is serialized such that they
can share the 16 bytes in a union.
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20241023234759.860539-5-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
bpf_selem_alloc()
In a later patch, the task local storage will only accept uptr
from the syscall update_elem and will not accept uptr from
the bpf prog. The reason is the bpf prog does not have a way
to provide a valid user space address.
bpf_local_storage_update() and bpf_selem_alloc() are used by
both bpf prog bpf_task_storage_get(BPF_LOCAL_STORAGE_GET_F_CREATE)
and bpf syscall update_elem. "bool swap_uptrs" arg is added
to bpf_local_storage_update() and bpf_selem_alloc() to tell if
it is called by the bpf prog or by the bpf syscall. When
swap_uptrs==true, it is called by the syscall.
The arg is named (swap_)uptrs because the later patch will swap
the uptrs between the newly allocated selem and the user space
provided map_value. It will make error handling easier in case
map->ops->map_update_elem() fails and the caller can decide
if it needs to unpin the uptr in the user space provided
map_value or the bpf_local_storage_update() has already
taken the uptr ownership and will take care of unpinning it also.
Only swap_uptrs==false is passed now. The logic to handle
the true case will be added in a later patch.
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20241023234759.860539-4-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
This patch adds BPF_UPTR support to the verifier. Not that only the
map_value will support the "__uptr" type tag.
This patch enforces only BPF_LDX is allowed to the value of an uptr.
After BPF_LDX, it will mark the dst_reg as PTR_TO_MEM | PTR_MAYBE_NULL
with size deduced from the field.kptr.btf_id. This will make the
dst_reg pointed memory to be readable and writable as scalar.
There is a redundant "val_reg = reg_state(env, value_regno);" statement
in the check_map_kptr_access(). This patch takes this chance to remove
it also.
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20241023234759.860539-3-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
This patch introduces the "__uptr" type tag to BTF. It is to define
a pointer pointing to the user space memory. This patch adds BTF
logic to pass the "__uptr" type tag.
btf_find_kptr() is reused for the "__uptr" tag. The "__uptr" will only
be supported in the map_value of the task storage map. However,
btf_parse_struct_meta() also uses btf_find_kptr() but it is not
interested in "__uptr". This patch adds a "field_mask" argument
to btf_find_kptr() which will return BTF_FIELD_IGNORE if the
caller is not interested in a “__uptr” field.
btf_parse_kptr() is also reused to parse the uptr.
The btf_check_and_fixup_fields() is changed to do extra
checks on the uptr to ensure that its struct size is not larger
than PAGE_SIZE. It is not clear how a uptr pointing to a CO-RE
supported kernel struct will be used, so it is also not allowed now.
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20241023234759.860539-2-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
If a newly-added link type doesn't invoke BPF_LINK_TYPE(), accessing
bpf_link_type_strs[link->type] may result in an out-of-bounds access.
To spot such missed invocations early in the future, checking the
validity of link->type in bpf_link_show_fdinfo() and emitting a warning
when such invocations are missed.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241024013558.1135167-3-houtao@huaweicloud.com
|
|
ops.dispatch() and ops.yield() may be fed a NULL task_struct pointer.
set_arg_maybe_null() is used to tell the verifier that they should be NULL
checked before being dereferenced. BPF now has an a lot prettier way to
express this - tagging arguments in CFI stubs with __nullable. Replace
set_arg_maybe_null() with __nullable CFI stub tags.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
|
|
CFI stubs can be used to tag arguments with __nullable (and possibly other
tags in the future) but for that to work the CFI stubs must have names that
are recognized by BPF. Rename them.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
|
|
rt_mutex_slowlock_block() and rtlock_slowlock_locked() both unlock
lock::wait_lock and then lock it later. This is unusual and sparse
complains about it.
Add __releases() + __acquires() annotation to mark that it is expected.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240812104200.2239232-5-bigeasy@linutronix.de
|
|
Every lock, that becomes a sleeping lock on PREEMPT_RT, starts a RCU read
side critical section. There is no sparse annotation for this and sparse
complains about unbalanced locking.
Add __acquires/ __releases for the RCU lock. This covers all but the
trylock functions. A __cond_acquires() annotation didn't work.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240812104200.2239232-4-bigeasy@linutronix.de
|
|
callback
Since SLOB was removed and since commit 6c6c47b063b5 ("mm, slab: call
kvfree_rcu_barrier() from kmem_cache_destroy()"), it is not longer
necessary to use call_rcu() when the callback only performs
kmem_cache_free(). Use kfree_rcu() directly.
The changes were made using Coccinelle.
Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Link: https://lore.kernel.org/all/20241013201704.49576-12-Julia.Lawall@inria.fr
|
|
If devres_destroy() does not find a matching devres entry, then
devm_free_irq() emits a warning and tries to free the interrupt.
That's wrong as devm_free_irq() should only undo what devm_request_irq()
set up.
Replace devres_destroy() with a call to devres_release() which only invokes
the release function (free_irq()) in case that a matching devres entry was
found.
[ tglx: Massaged change log ]
Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20241018-devres_kernel_fix-v2-1-08918ae84982@quicinc.com
|
|
We need `goto next_insn;` at the end of patching instead of `continue;`.
It currently works by accident by making verifier re-process patched
instructions.
Reported-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Fixes: 314a53623cd4 ("bpf: inline bpf_get_branch_snapshot() helper")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/r/20241023161916.2896274-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Peter reported that perf_event_detach_bpf_prog might skip to release
the bpf program for -ENOENT error from bpf_prog_array_copy.
This can't happen because bpf program is stored in perf event and is
detached and released only when perf event is freed.
Let's drop the -ENOENT check and make sure the bpf program is released
in any case.
Fixes: 170a7e3ea070 ("bpf: bpf_prog_array_copy() should return -ENOENT if exclude_prog not found")
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241023200352.3488610-1-jolsa@kernel.org
Closes: https://lore.kernel.org/lkml/20241022111638.GC16066@noisy.programming.kicks-ass.net/
|
|
Add help to update cgroup CGRP_FROZEN flag. Both cgroup_propagate_frozen
and cgroup_update_frozen functions update CGRP_FROZEN flag, this makes
code concise.
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Whether a cgroup is frozen is determined solely by whether it is set to
to be frozen and whether its parent is frozen. Currently, when is cgroup
is frozen or unfrozen, it iterates through the entire subtree to freeze
or unfreeze its descentdants. However, this is unesessary for a cgroup
that does not change its effective frozen status. This path aims to skip
the subtree if its parent does not have a change in effective freeze.
For an example, subtree like, a-b-c-d-e-f-g, when a is frozen, the
entire tree is frozen. If we freeze b and c again, it is unesessary to
iterate d, e, f and g. So does that If we unfreeze b/c.
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Rely on the scheduler topology information to implement basic LLC
awareness in the sched_ext build-in idle selection policy.
This allows schedulers using the built-in policy to make more informed
decisions when selecting an idle CPU in systems with multiple LLCs, such
as NUMA systems or chiplet-based architectures, and it helps keep tasks
within the same LLC domain, thereby improving cache locality.
For efficiency, LLC awareness is applied only to tasks that can run on
all the CPUs in the system for now. If a task's affinity is modified
from user space, it's the responsibility of user space to choose the
appropriate optimized scheduling domain.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Update ops.select_cpu() documentation to clarify that this method is not
called for tasks that are restricted to run on a single CPU, as these
tasks do not have the option to select a different CPU.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
This change allows the uprobe consumer to behave as session which
means that 'handler' and 'ret_handler' callbacks are connected in
a way that allows to:
- control execution of 'ret_handler' from 'handler' callback
- share data between 'handler' and 'ret_handler' callbacks
The session concept fits to our common use case where we do filtering
on entry uprobe and based on the result we decide to run the return
uprobe (or not).
It's also convenient to share the data between session callbacks.
To achive this we are adding new return value the uprobe consumer
can return from 'handler' callback:
UPROBE_HANDLER_IGNORE
- Ignore 'ret_handler' callback for this consumer.
And store cookie and pass it to 'ret_handler' when consumer has both
'handler' and 'ret_handler' callbacks defined.
We store shared data in the return_consumer object array as part of
the return_instance object. This way the handle_uretprobe_chain can
find related return_consumer and its shared data.
We also store entry handler return value, for cases when there are
multiple consumers on single uprobe and some of them are ignored and
some of them not, in which case the return probe gets installed and
we need to have a way to find out which consumer needs to be ignored.
The tricky part is when consumer is registered 'after' the uprobe
entry handler is hit. In such case this consumer's 'ret_handler' gets
executed as well, but it won't have the proper data pointer set,
so we can filter it out.
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20241018202252.693462-3-jolsa@kernel.org
|
|
Adding data pointer to both entry and exit consumer handlers and all
its users. The functionality itself is coming in following change.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20241018202252.693462-2-jolsa@kernel.org
|
|
Syzkaller robot reported KCSAN tripping over the
ASSERT_EXCLUSIVE_WRITER(p->on_rq) in __block_task().
The report noted that both pick_next_task_fair() and try_to_wake_up()
were concurrently trying to write to the same p->on_rq, violating the
assertion -- even though both paths hold rq->__lock.
The logical consequence is that both code paths end up holding a
different rq->__lock. And looking through ttwu(), this is possible
when the __block_task() 'p->on_rq = 0' store is visible to the ttwu()
'p->on_rq' load, which then assumes the task is not queued and
continues to migrate it.
Rearrange things such that __block_task() releases @p with the store
and no code thereafter will use @p again.
Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue")
Reported-by: syzbot+0ec1e96c2cdf5c0e512a@syzkaller.appspotmail.com
Reported-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Marco Elver <elver@google.com>
Link: https://lkml.kernel.org/r/20241023093641.GE16066@noisy.programming.kicks-ass.net
|
|
Running rcutorture scenario TREE05, the below warning is triggered.
[ 32.604594] WARNING: suspicious RCU usage
[ 32.605928] 6.11.0-rc5-00040-g4ba4f1afb6a9 #55238 Not tainted
[ 32.607812] -----------------------------
[ 32.609140] kernel/events/core.c:13946 RCU-list traversed in non-reader section!!
[ 32.611595] other info that might help us debug this:
[ 32.614247] rcu_scheduler_active = 2, debug_locks = 1
[ 32.616392] 3 locks held by cpuhp/4/35:
[ 32.617687] #0: ffffffffb666a650 (cpu_hotplug_lock){++++}-{0:0}, at: cpuhp_thread_fun+0x4e/0x200
[ 32.620563] #1: ffffffffb666cd20 (cpuhp_state-down){+.+.}-{0:0}, at: cpuhp_thread_fun+0x4e/0x200
[ 32.623412] #2: ffffffffb677c288 (pmus_lock){+.+.}-{3:3}, at: perf_event_exit_cpu_context+0x32/0x2f0
In perf_event_clear_cpumask(), uses list_for_each_entry_rcu() without an
obvious RCU read-side critical section.
Either pmus_srcu or pmus_lock is good enough to protect the pmus list.
In the current context, pmus_lock is already held. The
list_for_each_entry_rcu() is not required.
Fixes: 4ba4f1afb6a9 ("perf: Generic hotplug support for a PMU with a scope")
Closes: https://lore.kernel.org/lkml/2b66dff8-b827-494b-b151-1ad8d56f13e6@paulmck-laptop/
Closes: https://lore.kernel.org/oe-lkp/202409131559.545634cc-oliver.sang@intel.com
Reported-by: "Paul E. McKenney" <paulmck@kernel.org>
Reported-by: kernel test robot <oliver.sang@intel.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: "Paul E. McKenney" <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20240913162340.2142976-1-kan.liang@linux.intel.com
|
|
The value of ARRAY_SIZE() can be determined at compile time, so if both
sides of the equation are ARRAY_SIZE(), using BUILD_BUG_ON() can help us
catch the problem earlier.
While there are cases where unequal array sizes will work, there is no
point in allowing them, so it makes more sense to force them to be equal
using BUILD_BUG_ON().
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
|
|
Currently, once an RCU CPU stall warning decides to dump the stalling
CPUs' stacks, the rcu_dump_cpu_stacks() function persists until it
has gone through the full list. Unfortunately, if the stalled grace
periods ends midway through, this function will be dumping stacks of
innocent-bystander CPUs that happen to be blocking not the old grace
period, but instead the new one. This can cause serious confusion.
This commit therefore stops dumping stacks if and when the stalled grace
period ends.
[ paulmck: Apply Joel Fernandes feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
|
|
The rcu_gp_might_be_stalled() function is no longer used, so this commit
removes it.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
|
|
If get_clock_desc() succeeds, it calls fget() for the clockid's fd,
and get the clk->rwsem read lock, so the error path should release
the lock to make the lock balance and fput the clockid's fd to make
the refcount balance and release the fd related resource.
However the below commit left the error path locked behind resulting in
unbalanced locking. Check timespec64_valid_strict() before
get_clock_desc() to fix it, because the "ts" is not changed
after that.
Fixes: d8794ac20a29 ("posix-clock: Fix missing timespec64 check in pc_clock_settime()")
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Acked-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
[pabeni@redhat.com: fixed commit message typo]
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Reorganize kerneldoc parameter names to match the parameter
order in the function header.
Problems identified using Coccinelle.
Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
|
|
The data of user_table is never modified,
but only used as a template to create copies from.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
|
|
strlen() returns a string length excluding the null byte. If the string
length equals to the maximum buffer length, the buffer will have no
space for the NULL terminating character.
This commit checks this condition and returns failure for it.
Link: https://lore.kernel.org/all/20241007144724.920954-1-leo.yan@arm.com/
Fixes: dec65d79fd26 ("tracing/probe: Check event name length correctly")
Signed-off-by: Leo Yan <leo.yan@arm.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
|
|
When creating a trace_probe we would set nr_args prior to truncating the
arguments to MAX_TRACE_ARGS. However, we would only initialize arguments
up to the limit.
This caused invalid memory access when attempting to set up probes with
more than 128 fetchargs.
BUG: kernel NULL pointer dereference, address: 0000000000000020
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 0 UID: 0 PID: 1769 Comm: cat Not tainted 6.11.0-rc7+ #8
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014
RIP: 0010:__set_print_fmt+0x134/0x330
Resolve the issue by applying the MAX_TRACE_ARGS limit earlier. Return
an error when there are too many arguments instead of silently
truncating.
Link: https://lore.kernel.org/all/20240930202656.292869-1-mikel@mikelr.com/
Fixes: 035ba76014c0 ("tracing/probes: cleanup: Set trace_probe::nr_args at trace_probe_init")
Signed-off-by: Mikel Rychliski <mikel@mikelr.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
|
|
Remove hard-coded strings by using the helper function str_yes_no().
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Paul Moore <paul@paul-moore.com>
|
|
Lonial reported an issue in the BPF verifier where check_mem_size_reg()
has the following code:
if (!tnum_is_const(reg->var_off))
/* For unprivileged variable accesses, disable raw
* mode so that the program is required to
* initialize all the memory that the helper could
* just partially fill up.
*/
meta = NULL;
This means that writes are not checked when the register containing the
size of the passed buffer has not a fixed size. Through this bug, a BPF
program can write to a map which is marked as read-only, for example,
.rodata global maps.
The problem is that MEM_UNINIT's initial meaning that "the passed buffer
to the BPF helper does not need to be initialized" which was added back
in commit 435faee1aae9 ("bpf, verifier: add ARG_PTR_TO_RAW_STACK type")
got overloaded over time with "the passed buffer is being written to".
The problem however is that checks such as the above which were added later
via 06c1c049721a ("bpf: allow helpers access to variable memory") set meta
to NULL in order force the user to always initialize the passed buffer to
the helper. Due to the current double meaning of MEM_UNINIT, this bypasses
verifier write checks to the memory (not boundary checks though) and only
assumes the latter memory is read instead.
Fix this by reverting MEM_UNINIT back to its original meaning, and having
MEM_WRITE as an annotation to BPF helpers in order to then trigger the
BPF verifier checks for writing to memory.
Some notes: check_arg_pair_ok() ensures that for ARG_CONST_SIZE{,_OR_ZERO}
we can access fn->arg_type[arg - 1] since it must contain a preceding
ARG_PTR_TO_MEM. For check_mem_reg() the meta argument can be removed
altogether since we do check both BPF_READ and BPF_WRITE. Same for the
equivalent check_kfunc_mem_size_reg().
Fixes: 7b3552d3f9f6 ("bpf: Reject writes for PTR_TO_MAP_KEY in check_helper_mem_access")
Fixes: 97e6d7dab1ca ("bpf: Check PTR_TO_MEM | MEM_RDONLY in check_helper_mem_access")
Fixes: 15baa55ff5b0 ("bpf/verifier: allow all functions to read user provided context")
Reported-by: Lonial Con <kongln9170@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241021152809.33343-2-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Add a MEM_WRITE attribute for BPF helper functions which can be used in
bpf_func_proto to annotate an argument type in order to let the verifier
know that the helper writes into the memory passed as an argument. In
the past MEM_UNINIT has been (ab)used for this function, but the latter
merely tells the verifier that the passed memory can be uninitialized.
There have been bugs with overloading the latter but aside from that
there are also cases where the passed memory is read + written which
currently cannot be expressed, see also 4b3786a6c539 ("bpf: Zero former
ARG_PTR_TO_{LONG,INT} args in case of error").
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241021152809.33343-1-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
In bpf_parse_param(), keep the value of param->string intact so it can
be freed later. Otherwise, the kmalloc area pointed to by param->string
will be leaked as shown below:
unreferenced object 0xffff888118c46d20 (size 8):
comm "new_name", pid 12109, jiffies 4295580214
hex dump (first 8 bytes):
61 6e 79 00 38 c9 5c 7e any.8.\~
backtrace (crc e1b7f876):
[<00000000c6848ac7>] kmemleak_alloc+0x4b/0x80
[<00000000de9f7d00>] __kmalloc_node_track_caller_noprof+0x36e/0x4a0
[<000000003e29b886>] memdup_user+0x32/0xa0
[<0000000007248326>] strndup_user+0x46/0x60
[<0000000035b3dd29>] __x64_sys_fsconfig+0x368/0x3d0
[<0000000018657927>] x64_sys_call+0xff/0x9f0
[<00000000c0cabc95>] do_syscall_64+0x3b/0xc0
[<000000002f331597>] entry_SYSCALL_64_after_hwframe+0x4b/0x53
Fixes: 6c1752e0b6ca ("bpf: Support symbolic BPF FS delegation mount options")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20241022130133.3798232-1-houtao@huaweicloud.com
|
|
Commit 17351eb59abd ("rcu/nocb: Simplify (de-)offloading state machine")
removed the implementation but leave declaration.
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
|
|
Implement bpf_send_signal_task kfunc that is similar to
bpf_send_signal_thread and bpf_send_signal helpers but can be used to
send signals to other threads and processes. It also supports sending a
cookie with the signal similar to sigqueue().
If the receiving process establishes a handler for the signal using the
SA_SIGINFO flag to sigaction(), then it can obtain this cookie via the
si_value field of the siginfo_t structure passed as the second argument
to the handler.
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241016084136.10305-2-puranjay@kernel.org
|
|
Only cgroup v2 can be attached by bpf programs, so this patch introduces
that cgroup_bpf_inherit and cgroup_bpf_offline can only be called in
cgroup v2, and this can fix the memleak mentioned by commit 04f8ef5643bc
("cgroup: Fix memory leak caused by missing cgroup_bpf_offline"), which
has been reverted.
Fixes: 2b0d3d3e4fcf ("percpu_ref: reduce memory footprint of percpu_ref in fast path")
Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself")
Link: https://lore.kernel.org/cgroups/aka2hk5jsel5zomucpwlxsej6iwnfw4qu5jkrmjhyfhesjlfdw@46zxhg5bdnr7/
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
This reverts commit 04f8ef5643bcd8bcde25dfdebef998aea480b2ba.
Only cgroup v2 can be attached by cgroup by BPF programs. Revert this
commit and cgroup_bpf_inherit and cgroup_bpf_offline won't be called in
cgroup v1. The memory leak issue will be fixed with next patch.
Fixes: 04f8ef5643bc ("cgroup: Fix memory leak caused by missing cgroup_bpf_offline")
Link: https://lore.kernel.org/cgroups/aka2hk5jsel5zomucpwlxsej6iwnfw4qu5jkrmjhyfhesjlfdw@46zxhg5bdnr7/
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
sched_getattr(2) doesn't care about trailing non-zero bytes in the
(ksize > usize) case, so just use copy_struct_to_user() without checking
ignored_trailing.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Link: https://lore.kernel.org/r/20241010-extensible-structs-check_fields-v3-2-d2833dfe6edd@cyphar.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Overlapping fixes solving the same bug slightly differently:
7266f0a6d3bb fs/bcachefs: Fix __wait_on_freeing_inode() definition of waitqueue entry
3b80552e7057 bcachefs: __wait_for_freeing_inode: Switch to wait_bit_queue_entry
Use the upstream version.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Uprobe needs to fetch args into a percpu buffer, and then copy to ring
buffer to avoid non-atomic context problem.
Sometimes user-space strings, arrays can be very large, but the size of
percpu buffer is only page size. And store_trace_args() won't check
whether these data exceeds a single page or not, caused out-of-bounds
memory access.
It could be reproduced by following steps:
1. build kernel with CONFIG_KASAN enabled
2. save follow program as test.c
```
\#include <stdio.h>
\#include <stdlib.h>
\#include <string.h>
// If string length large than MAX_STRING_SIZE, the fetch_store_strlen()
// will return 0, cause __get_data_size() return shorter size, and
// store_trace_args() will not trigger out-of-bounds access.
// So make string length less than 4096.
\#define STRLEN 4093
void generate_string(char *str, int n)
{
int i;
for (i = 0; i < n; ++i)
{
char c = i % 26 + 'a';
str[i] = c;
}
str[n-1] = '\0';
}
void print_string(char *str)
{
printf("%s\n", str);
}
int main()
{
char tmp[STRLEN];
generate_string(tmp, STRLEN);
print_string(tmp);
return 0;
}
```
3. compile program
`gcc -o test test.c`
4. get the offset of `print_string()`
```
objdump -t test | grep -w print_string
0000000000401199 g F .text 000000000000001b print_string
```
5. configure uprobe with offset 0x1199
```
off=0x1199
cd /sys/kernel/debug/tracing/
echo "p /root/test:${off} arg1=+0(%di):ustring arg2=\$comm arg3=+0(%di):ustring"
> uprobe_events
echo 1 > events/uprobes/enable
echo 1 > tracing_on
```
6. run `test`, and kasan will report error.
==================================================================
BUG: KASAN: use-after-free in strncpy_from_user+0x1d6/0x1f0
Write of size 8 at addr ffff88812311c004 by task test/499CPU: 0 UID: 0 PID: 499 Comm: test Not tainted 6.12.0-rc3+ #18
Hardware name: Red Hat KVM, BIOS 1.16.0-4.al8 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x55/0x70
print_address_description.constprop.0+0x27/0x310
kasan_report+0x10f/0x120
? strncpy_from_user+0x1d6/0x1f0
strncpy_from_user+0x1d6/0x1f0
? rmqueue.constprop.0+0x70d/0x2ad0
process_fetch_insn+0xb26/0x1470
? __pfx_process_fetch_insn+0x10/0x10
? _raw_spin_lock+0x85/0xe0
? __pfx__raw_spin_lock+0x10/0x10
? __pte_offset_map+0x1f/0x2d0
? unwind_next_frame+0xc5f/0x1f80
? arch_stack_walk+0x68/0xf0
? is_bpf_text_address+0x23/0x30
? kernel_text_address.part.0+0xbb/0xd0
? __kernel_text_address+0x66/0xb0
? unwind_get_return_address+0x5e/0xa0
? __pfx_stack_trace_consume_entry+0x10/0x10
? arch_stack_walk+0xa2/0xf0
? _raw_spin_lock_irqsave+0x8b/0xf0
? __pfx__raw_spin_lock_irqsave+0x10/0x10
? depot_alloc_stack+0x4c/0x1f0
? _raw_spin_unlock_irqrestore+0xe/0x30
? stack_depot_save_flags+0x35d/0x4f0
? kasan_save_stack+0x34/0x50
? kasan_save_stack+0x24/0x50
? mutex_lock+0x91/0xe0
? __pfx_mutex_lock+0x10/0x10
prepare_uprobe_buffer.part.0+0x2cd/0x500
uprobe_dispatcher+0x2c3/0x6a0
? __pfx_uprobe_dispatcher+0x10/0x10
? __kasan_slab_alloc+0x4d/0x90
handler_chain+0xdd/0x3e0
handle_swbp+0x26e/0x3d0
? __pfx_handle_swbp+0x10/0x10
? uprobe_pre_sstep_notifier+0x151/0x1b0
irqentry_exit_to_user_mode+0xe2/0x1b0
asm_exc_int3+0x39/0x40
RIP: 0033:0x401199
Code: 01 c2 0f b6 45 fb 88 02 83 45 fc 01 8b 45 fc 3b 45 e4 7c b7 8b 45 e4 48 98 48 8d 50 ff 48 8b 45 e8 48 01 d0 ce
RSP: 002b:00007ffdf00576a8 EFLAGS: 00000206
RAX: 00007ffdf00576b0 RBX: 0000000000000000 RCX: 0000000000000ff2
RDX: 0000000000000ffc RSI: 0000000000000ffd RDI: 00007ffdf00576b0
RBP: 00007ffdf00586b0 R08: 00007feb2f9c0d20 R09: 00007feb2f9c0d20
R10: 0000000000000001 R11: 0000000000000202 R12: 0000000000401040
R13: 00007ffdf0058780 R14: 0000000000000000 R15: 0000000000000000
</TASK>
This commit enforces the buffer's maxlen less than a page-size to avoid
store_trace_args() out-of-memory access.
Link: https://lore.kernel.org/all/20241015060148.1108331-1-mqaio@linux.alibaba.com/
Fixes: dcad1a204f72 ("tracing/uprobes: Fetch args before reserving a ring buffer")
Signed-off-by: Qiao Ma <mqaio@linux.alibaba.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduling fixes from Borislav Petkov:
- Add PREEMPT_RT maintainers
- Fix another aspect of delayed dequeued tasks wrt determining their
state, i.e., whether they're runnable or blocked
- Handle delayed dequeued tasks and their migration wrt PSI properly
- Fix the situation where a delayed dequeue task gets enqueued into a
new class, which should not happen
- Fix a case where memory allocation would happen while the runqueue
lock is held, which is a no-no
- Do not over-schedule when tasks with shorter slices preempt the
currently running task
- Make sure delayed to deque entities are properly handled before
unthrottling
- Other smaller cleanups and improvements
* tag 'sched_urgent_for_v6.12_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
MAINTAINERS: Add an entry for PREEMPT_RT.
sched/fair: Fix external p->on_rq users
sched/psi: Fix mistaken CPU pressure indication after corrupted task state bug
sched/core: Dequeue PSI signals for blocked tasks that are delayed
sched: Fix delayed_dequeue vs switched_from_fair()
sched/core: Disable page allocation in task_tick_mm_cid()
sched/deadline: Use hrtick_enabled_dl() before start_hrtick_dl()
sched/eevdf: Fix wakeup-preempt by checking cfs_rq->nr_running
sched: Fix sched_delayed vs cfs_bandwidth
|
|
Using commas to declare struct members makes adding new members to this
struct not as nice with patch management.
Test results linux-modules-kpd succeed [0].
Signed-off-by: Matthew Maurer <mmaurer@google.com>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
[mcgrof: add automated test results from kdevops using KPD ]
Link: https://github.com/linux-kdevops/linux-modules-kpd/actions/runs/11420095343 # [0]
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
|
|
Validate properties of the strtab that are depended on elsewhere, but
were previously unchecked:
* String table nonempty (offset 0 is valid)
* String table has a leading NUL (offset 0 corresponds to "")
* String table is NUL terminated (strfoo functions won't run out of the
table while reading).
* All symbols names are inbounds of the string table.
Signed-off-by: Matthew Maurer <mmaurer@google.com>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
|
|
This patch only moves the existing strtab population to a function.
Validation comes in a following patch, this is split out to make the new
validation checks more clearly separated.
Signed-off-by: Matthew Maurer <mmaurer@google.com>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
|
|
Group all the index detection together to make the parent function
easier to read.
Signed-off-by: Matthew Maurer <mmaurer@google.com>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
|
|
Pull out index validation for the symbol string section.
Note that this does not validate the *contents* of the string table,
only shape and presence of the section.
Signed-off-by: Matthew Maurer <mmaurer@google.com>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
|