summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2024-02-14rcu/exp: Remove rcu_par_gp_wqFrederic Weisbecker
TREE04 running on short iterations can produce writer stalls of the following kind: ??? Writer stall state RTWS_EXP_SYNC(4) g3968 f0x0 ->state 0x2 cpu 0 task:rcu_torture_wri state:D stack:14568 pid:83 ppid:2 flags:0x00004000 Call Trace: <TASK> __schedule+0x2de/0x850 ? trace_event_raw_event_rcu_exp_funnel_lock+0x6d/0xb0 schedule+0x4f/0x90 synchronize_rcu_expedited+0x430/0x670 ? __pfx_autoremove_wake_function+0x10/0x10 ? __pfx_synchronize_rcu_expedited+0x10/0x10 do_rtws_sync.constprop.0+0xde/0x230 rcu_torture_writer+0x4b4/0xcd0 ? __pfx_rcu_torture_writer+0x10/0x10 kthread+0xc7/0xf0 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x2f/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1b/0x30 </TASK> Waiting for an expedited grace period and polling for an expedited grace period both are operations that internally rely on the same workqueue performing necessary asynchronous work. However, a dependency chain is involved between those two operations, as depicted below: ====== CPU 0 ======= ====== CPU 1 ======= synchronize_rcu_expedited() exp_funnel_lock() mutex_lock(&rcu_state.exp_mutex); start_poll_synchronize_rcu_expedited queue_work(rcu_gp_wq, &rnp->exp_poll_wq); synchronize_rcu_expedited_queue_work() queue_work(rcu_gp_wq, &rew->rew_work); wait_event() // A, wait for &rew->rew_work completion mutex_unlock() // B //======> switch to kworker sync_rcu_do_polled_gp() { synchronize_rcu_expedited() exp_funnel_lock() mutex_lock(&rcu_state.exp_mutex); // C, wait B .... } // D Since workqueues are usually implemented on top of several kworkers handling the queue concurrently, the above situation wouldn't deadlock most of the time because A then doesn't depend on D. But in case of memory stress, a single kworker may end up handling alone all the works in a serialized way. In that case the above layout becomes a problem because A then waits for D, closing a circular dependency: A -> D -> C -> B -> A This however only happens when CONFIG_RCU_EXP_KTHREAD=n. Indeed synchronize_rcu_expedited() is otherwise implemented on top of a kthread worker while polling still relies on rcu_gp_wq workqueue, breaking the above circular dependency chain. Fix this with making expedited grace period to always rely on kthread worker. The workqueue based implementation is essentially a duplicate anyway now that the per-node initialization is performed by per-node kthread workers. Meanwhile the CONFIG_RCU_EXP_KTHREAD switch is still kept around to manage the scheduler policy of these kthread workers. Reported-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Reported-by: Thomas Gleixner <tglx@linutronix.de> Suggested-by: Joel Fernandes <joel@joelfernandes.org> Suggested-by: Paul E. McKenney <paulmck@kernel.org> Suggested-by: Neeraj upadhyay <Neeraj.Upadhyay@amd.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14rcu/exp: Handle parallel exp gp kworkers affinityFrederic Weisbecker
Affine the parallel expedited gp kworkers to their respective RCU node in order to make them close to the cache their are playing with. This reuses the boost kthreads machinery that probe into CPU hotplug operations such that the kthreads become/stay affine to their respective node as soon/long as they contain online CPUs. Otherwise and if the current CPU going down was the last online on the leaf node, the related kthread is affine to the housekeeping CPUs. In the long run, this affinity VS CPU hotplug operation game should probably be implemented at the generic kthread level. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> [boqun: s/* rcu_boost_task/*rcu_boost_task as reported by checkpatch] Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14rcu/exp: Make parallel exp gp kworker per rcu nodeFrederic Weisbecker
When CONFIG_RCU_EXP_KTHREAD=n, the expedited grace period per node initialization is performed in parallel via workqueues (one work per node). However in CONFIG_RCU_EXP_KTHREAD=y, this per node initialization is performed by a single kworker serializing each node initialization (one work for all nodes). The second part is certainly less scalable and efficient beyond a single leaf node. To improve this, expand this single kworker into per-node kworkers. This new layout is eventually intended to remove the workqueues based implementation since it will essentially now become duplicate code. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14rcu/exp: Move expedited kthread worker creation functions above ↵Frederic Weisbecker
rcutree_prepare_cpu() The expedited kthread worker performing the per node initialization is going to be split into per node kthreads. As such, the future per node kthread creation will need to be called from CPU hotplug callbacks instead of an initcall, right beside the per node boost kthread creation. To prepare for that, move the kthread worker creation above rcutree_prepare_cpu() as a first step to make the review smoother for the upcoming modifications. No intended functional change. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14rcu: s/boost_kthread_mutex/kthread_mutexFrederic Weisbecker
This mutex is currently protecting per node boost kthreads creation and affinity setting across CPU hotplug operations. Since the expedited kworkers will soon be split per node as well, they will be subject to the same concurrency constraints against hotplug. Therefore their creation and affinity tuning operations will be grouped with those of boost kthreads and then rely on the same mutex. To prepare for that, generalize its name. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14rcu/exp: Handle RCU expedited grace period kworker allocation failureFrederic Weisbecker
Just like is done for the kworker performing nodes initialization, gracefully handle the possible allocation failure of the RCU expedited grace period main kworker. While at it perform a rename of the related checking functions to better reflect the expedited specifics. Reviewed-by: Kalesh Singh <kaleshsingh@google.com> Fixes: 9621fbee44df ("rcu: Move expedited grace period (GP) work to RT kthread_worker") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14rcu/exp: Fix RCU expedited parallel grace period kworker allocation failure ↵Frederic Weisbecker
recovery Under CONFIG_RCU_EXP_KTHREAD=y, the nodes initialization for expedited grace periods is queued to a kworker. However if the allocation of that kworker failed, the nodes initialization is performed synchronously by the caller instead. Now the check for kworker initialization failure relies on the kworker pointer to be NULL while its value might actually encapsulate an allocation failure error. Make sure to handle this case. Reviewed-by: Kalesh Singh <kaleshsingh@google.com> Fixes: 9621fbee44df ("rcu: Move expedited grace period (GP) work to RT kthread_worker") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14rcu/exp: Remove full barrier upon main thread wakeupFrederic Weisbecker
When an expedited grace period is ending, care must be taken so that all the quiescent states propagated up to the root are correctly ordered against the wake up of the main expedited grace period workqueue. This ordering is already carried through the root rnp locking augmented by an smp_mb__after_unlock_lock() barrier. Therefore the explicit smp_mb() placed before the wake up is not needed and can be removed. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14rcu/nocb: Check rdp_gp->nocb_timer in __call_rcu_nocb_wake()Zqiang
Currently, only rdp_gp->nocb_timer is used, for nocb_timer of no-rdp_gp structure, the timer_pending() is always return false, this commit therefore need to check rdp_gp->nocb_timer in __call_rcu_nocb_wake(). Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14rcu/nocb: Fix WARN_ON_ONCE() in the rcu_nocb_bypass_lock()Zqiang
For the kernels built with CONFIG_RCU_NOCB_CPU_DEFAULT_ALL=y and CONFIG_RCU_LAZY=y, the following scenarios will trigger WARN_ON_ONCE() in the rcu_nocb_bypass_lock() and rcu_nocb_wait_contended() functions: CPU2 CPU11 kthread rcu_nocb_cb_kthread ksys_write rcu_do_batch vfs_write rcu_torture_timer_cb proc_sys_write __kmem_cache_free proc_sys_call_handler kmemleak_free drop_caches_sysctl_handler delete_object_full drop_slab __delete_object shrink_slab put_object lazy_rcu_shrink_scan call_rcu rcu_nocb_flush_bypass __call_rcu_commn rcu_nocb_bypass_lock raw_spin_trylock(&rdp->nocb_bypass_lock) fail atomic_inc(&rdp->nocb_lock_contended); rcu_nocb_wait_contended WARN_ON_ONCE(smp_processor_id() != rdp->cpu); WARN_ON_ONCE(atomic_read(&rdp->nocb_lock_contended)) | |_ _ _ _ _ _ _ _ _ _same rdp and rdp->cpu != 11_ _ _ _ _ _ _ _ _ __| Reproduce this bug with "echo 3 > /proc/sys/vm/drop_caches". This commit therefore uses rcu_nocb_try_flush_bypass() instead of rcu_nocb_flush_bypass() in lazy_rcu_shrink_scan(). If the nocb_bypass queue is being flushed, then rcu_nocb_try_flush_bypass will return directly. Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14rcu/nocb: Re-arrange call_rcu() NOCB specific codeFrederic Weisbecker
Currently the call_rcu() function interleaves NOCB and !NOCB enqueue code in a complicated way such that: * The bypass enqueue code may or may not have enqueued and may or may not have locked the ->nocb_lock. Everything that follows is in a Schrödinger locking state for the unwary reviewer's eyes. * The was_alldone is always set but only used in NOCB related code. * The NOCB wake up is distantly related to the locking hopefully performed by the bypass enqueue code that did not enqueue on the bypass list. Unconfuse the whole and gather NOCB and !NOCB specific enqueue code to their own functions. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14rcu/nocb: Make IRQs disablement symmetricFrederic Weisbecker
Currently IRQs are disabled on call_rcu() and then depending on the context: * If the CPU is in nocb mode: - If the callback is enqueued in the bypass list, IRQs are re-enabled implictly by rcu_nocb_try_bypass() - If the callback is enqueued in the normal list, IRQs are re-enabled implicitly by __call_rcu_nocb_wake() * If the CPU is NOT in nocb mode, IRQs are reenabled explicitly from call_rcu() This makes the code a bit hard to follow, especially as it interleaves with nocb locking. To make the IRQ flags coverage clearer and also in order to prepare for moving all the nocb enqueue code to its own function, always re-enable the IRQ flags explicitly from call_rcu(). Reviewed-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14rcu/nocb: Remove needless full barrier after callback advancingFrederic Weisbecker
A full barrier is issued from nocb_gp_wait() upon callbacks advancing to order grace period completion with callbacks execution. However these two events are already ordered by the smp_mb__after_unlock_lock() barrier within the call to raw_spin_lock_rcu_node() that is necessary for callbacks advancing to happen. The following litmus test shows the kind of guarantee that this barrier provides: C smp_mb__after_unlock_lock {} // rcu_gp_cleanup() P0(spinlock_t *rnp_lock, int *gpnum) { // Grace period cleanup increase gp sequence number spin_lock(rnp_lock); WRITE_ONCE(*gpnum, 1); spin_unlock(rnp_lock); } // nocb_gp_wait() P1(spinlock_t *rnp_lock, spinlock_t *nocb_lock, int *gpnum, int *cb_ready) { int r1; // Call rcu_advance_cbs() from nocb_gp_wait() spin_lock(nocb_lock); spin_lock(rnp_lock); smp_mb__after_unlock_lock(); r1 = READ_ONCE(*gpnum); WRITE_ONCE(*cb_ready, 1); spin_unlock(rnp_lock); spin_unlock(nocb_lock); } // nocb_cb_wait() P2(spinlock_t *nocb_lock, int *cb_ready, int *cb_executed) { int r2; // rcu_do_batch() -> rcu_segcblist_extract_done_cbs() spin_lock(nocb_lock); r2 = READ_ONCE(*cb_ready); spin_unlock(nocb_lock); // Actual callback execution WRITE_ONCE(*cb_executed, 1); } P3(int *cb_executed, int *gpnum) { int r3; WRITE_ONCE(*cb_executed, 2); smp_mb(); r3 = READ_ONCE(*gpnum); } exists (1:r1=1 /\ 2:r2=1 /\ cb_executed=2 /\ 3:r3=0) (* Bad outcome. *) Here the bad outcome only occurs if the smp_mb__after_unlock_lock() is removed. This barrier orders the grace period completion against callbacks advancing and even later callbacks invocation, thanks to the opportunistic propagation via the ->nocb_lock to nocb_cb_wait(). Therefore the smp_mb() placed after callbacks advancing can be safely removed. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14rcu/nocb: Remove needless LOAD-ACQUIREFrederic Weisbecker
The LOAD-ACQUIRE access performed on rdp->nocb_cb_sleep advertizes ordering callback execution against grace period completion. However this is contradicted by the following: * This LOAD-ACQUIRE doesn't pair with anything. The only counterpart barrier that can be found is the smp_mb() placed after callbacks advancing in nocb_gp_wait(). However the barrier is placed _after_ ->nocb_cb_sleep write. * Callbacks can be concurrently advanced between the LOAD-ACQUIRE on ->nocb_cb_sleep and the call to rcu_segcblist_extract_done_cbs() in rcu_do_batch(), making any ordering based on ->nocb_cb_sleep broken. * Both rcu_segcblist_extract_done_cbs() and rcu_advance_cbs() are called under the nocb_lock, the latter hereby providing already the desired ACQUIRE semantics. Therefore it is safe to access ->nocb_cb_sleep with a simple compiler barrier. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2024-02-14Merge branch 'x86/bugs' into x86/core, to pick up pending changes before ↵Ingo Molnar
dependent patches Merge in pending alternatives patching infrastructure changes, before applying more patches. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-02-13bpf: emit source code file name and line number in verifier logAndrii Nakryiko
As BPF applications grow in size and complexity and are separated into multiple .bpf.c files that are statically linked together, it becomes harder and harder to match verifier's BPF assembly level output to original C code. While often annotated C source code is unique enough to be able to identify the file it belongs to, quite often this is actually problematic as parts of source code can be quite generic. Long story short, it is very useful to see source code file name and line number information along with the original C code. Verifier already knows this information, we just need to output it. This patch extends verifier log with file name and line number information, emitted next to original (presumably C) source code, annotating BPF assembly output, like so: ; <original C code> @ <filename>.bpf.c:<line> If file name has directory names in it, they are stripped away. This should be fine in practice as file names tend to be pretty unique with C code anyways, and keeping log size smaller is always good. In practice this might look something like below, where some code is coming from application files, while others are from libbpf's usdt.bpf.h header file: ; if (STROBEMETA_READ( @ strobemeta_probe.bpf.c:534 5592: (79) r1 = *(u64 *)(r10 -56) ; R1_w=mem_or_null(id=1589,sz=7680) R10=fp0 5593: (7b) *(u64 *)(r10 -56) = r1 ; R1_w=mem_or_null(id=1589,sz=7680) R10=fp0 5594: (79) r3 = *(u64 *)(r10 -8) ; R3_w=scalar() R10=fp0 fp-8=mmmmmmmm ... 170: (71) r1 = *(u8 *)(r8 +15) ; frame1: R1_w=scalar(...) R8_w=map_value(map=__bpf_usdt_spec,ks=4,vs=208) 171: (67) r1 <<= 56 ; frame1: R1_w=scalar(...) 172: (c7) r1 s>>= 56 ; frame1: R1_w=scalar(smin=smin32=-128,smax=smax32=127) ; val <<= arg_spec->arg_bitshift; @ usdt.bpf.h:183 173: (67) r1 <<= 32 ; frame1: R1_w=scalar(...) 174: (77) r1 >>= 32 ; frame1: R1_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) 175: (79) r2 = *(u64 *)(r10 -8) ; frame1: R2_w=scalar() R10=fp0 fp-8=mmmmmmmm 176: (6f) r2 <<= r1 ; frame1: R1_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R2_w=scalar() 177: (7b) *(u64 *)(r10 -8) = r2 ; frame1: R2_w=scalar(id=61) R10=fp0 fp-8_w=scalar(id=61) ; if (arg_spec->arg_signed) @ usdt.bpf.h:184 178: (bf) r3 = r2 ; frame1: R2_w=scalar(id=61) R3_w=scalar(id=61) 179: (7f) r3 >>= r1 ; frame1: R1_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R3_w=scalar() ; if (arg_spec->arg_signed) @ usdt.bpf.h:184 180: (71) r4 = *(u8 *)(r8 +14) 181: safe log_fixup tests needed a minor adjustment as verifier log output increased a bit and that test is quite sensitive to such changes. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240212235944.2816107-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-02-13bpf: don't infer PTR_TO_CTX for programs with unnamed context typeAndrii Nakryiko
For program types that don't have named context type name (e.g., BPF iterator programs or tracepoint programs), ctx_tname will be a non-NULL empty string. For such programs it shouldn't be possible to have PTR_TO_CTX argument for global subprogs based on type name alone. arg:ctx tag is the only way to have PTR_TO_CTX passed into global subprog for such program types. Fix this loophole, which currently would assume PTR_TO_CTX whenever user uses a pointer to anonymous struct as an argument to their global subprogs. This happens in practice with the following (quite common, in practice) approach: typedef struct { /* anonymous */ int x; } my_type_t; int my_subprog(my_type_t *arg) { ... } User's intent is to have PTR_TO_MEM argument for `arg`, but verifier will complain about expecting PTR_TO_CTX. This fix also closes unintended s390x-specific KPROBE handling of PTR_TO_CTX case. Selftest change is necessary to accommodate this. Fixes: 91cc1a99740e ("bpf: Annotate context types") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240212233221.2575350-4-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-02-13bpf: handle bpf_user_pt_regs_t typedef explicitly for PTR_TO_CTX global argAndrii Nakryiko
Expected canonical argument type for global function arguments representing PTR_TO_CTX is `bpf_user_pt_regs_t *ctx`. This currently works on s390x by accident because kernel resolves such typedef to underlying struct (which is anonymous on s390x), and erroneously accepting it as expected context type. We are fixing this problem next, which would break s390x arch, so we need to handle `bpf_user_pt_regs_t` case explicitly for KPROBE programs. Fixes: 91cc1a99740e ("bpf: Annotate context types") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240212233221.2575350-3-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-02-13bpf: simplify btf_get_prog_ctx_type() into btf_is_prog_ctx_type()Andrii Nakryiko
Return result of btf_get_prog_ctx_type() is never used and callers only check NULL vs non-NULL case to determine if given type matches expected PTR_TO_CTX type. So rename function to `btf_is_prog_ctx_type()` and return a simple true/false. We'll use this simpler interface to handle kprobe program type's special typedef case in the next patch. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240212233221.2575350-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-02-13bpf: remove check in __cgroup_bpf_run_filter_skbOliver Crumrine
Originally, this patch removed a redundant check in BPF_CGROUP_RUN_PROG_INET_EGRESS, as the check was already being done in the function it called, __cgroup_bpf_run_filter_skb. For v2, it was reccomended that I remove the check from __cgroup_bpf_run_filter_skb, and add the checks to the other macro that calls that function, BPF_CGROUP_RUN_PROG_INET_INGRESS. To sum it up, checking that the socket exists and that it is a full socket is now part of both macros BPF_CGROUP_RUN_PROG_INET_EGRESS and BPF_CGROUP_RUN_PROG_INET_INGRESS, and it is no longer part of the function they call, __cgroup_bpf_run_filter_skb. v3->v4: Fixed weird merge conflict. v2->v3: Sent to bpf-next instead of generic patch v1->v2: Addressed feedback about where check should be removed. Signed-off-by: Oliver Crumrine <ozlinuxc@gmail.com> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/7lv62yiyvmj5a7eozv2iznglpkydkdfancgmbhiptrgvgan5sy@3fl3onchgdz3 Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-02-13bpf: Create argument information for nullable arguments.Kui-Feng Lee
Collect argument information from the type information of stub functions to mark arguments of BPF struct_ops programs with PTR_MAYBE_NULL if they are nullable. A nullable argument is annotated by suffixing "__nullable" at the argument name of stub function. For nullable arguments, this patch sets a struct bpf_ctx_arg_aux to label their reg_type with PTR_TO_BTF_ID | PTR_TRUSTED | PTR_MAYBE_NULL. This makes the verifier to check programs and ensure that they properly check the pointer. The programs should check if the pointer is null before accessing the pointed memory. The implementer of a struct_ops type should annotate the arguments that can be null. The implementer should define a stub function (empty) as a placeholder for each defined operator. The name of a stub function should be in the pattern "<st_op_type>__<operator name>". For example, for test_maybe_null of struct bpf_testmod_ops, it's stub function name should be "bpf_testmod_ops__test_maybe_null". You mark an argument nullable by suffixing the argument name with "__nullable" at the stub function. Since we already has stub functions for kCFI, we just reuse these stub functions with the naming convention mentioned earlier. These stub functions with the naming convention is only required if there are nullable arguments to annotate. For functions having not nullable arguments, stub functions are not necessary for the purpose of this patch. This patch will prepare a list of struct bpf_ctx_arg_aux, aka arg_info, for each member field of a struct_ops type. "arg_info" will be assigned to "prog->aux->ctx_arg_info" of BPF struct_ops programs in check_struct_ops_btf_id() so that it can be used by btf_ctx_access() later to set reg_type properly for the verifier. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240209023750.1153905-4-thinker.li@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-02-13bpf: Move __kfunc_param_match_suffix() to btf.c.Kui-Feng Lee
Move __kfunc_param_match_suffix() to btf.c and rename it as btf_param_match_suffix(). It can be reused by bpf_struct_ops later. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240209023750.1153905-3-thinker.li@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-02-13bpf: add btf pointer to struct bpf_ctx_arg_aux.Kui-Feng Lee
Enable the providers to use types defined in a module instead of in the kernel (btf_vmlinux). Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240209023750.1153905-2-thinker.li@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-02-13bpf: Fix warning for bpf_cpumask in verifierHari Bathini
Compiling with CONFIG_BPF_SYSCALL & !CONFIG_BPF_JIT throws the below warning: "WARN: resolve_btfids: unresolved symbol bpf_cpumask" Fix it by adding the appropriate #ifdef. Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Stanislav Fomichev <sdf@google.com> Acked-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/bpf/20240208100115.602172-1-hbathini@linux.ibm.com
2024-02-13bpf: Mark bpf_spin_{lock,unlock}() helpers with notrace correctlyYonghong Song
Currently tracing is supposed not to allow for bpf_spin_{lock,unlock}() helper calls. This is to prevent deadlock for the following cases: - there is a prog (prog-A) calling bpf_spin_{lock,unlock}(). - there is a tracing program (prog-B), e.g., fentry, attached to bpf_spin_lock() and/or bpf_spin_unlock(). - prog-B calls bpf_spin_{lock,unlock}(). For such a case, when prog-A calls bpf_spin_{lock,unlock}(), a deadlock will happen. The related source codes are below in kernel/bpf/helpers.c: notrace BPF_CALL_1(bpf_spin_lock, struct bpf_spin_lock *, lock) notrace BPF_CALL_1(bpf_spin_unlock, struct bpf_spin_lock *, lock) notrace is supposed to prevent fentry prog from attaching to bpf_spin_{lock,unlock}(). But actually this is not the case and fentry prog can successfully attached to bpf_spin_lock(). Siddharth Chintamaneni reported the issue in [1]. The following is the macro definition for above BPF_CALL_1: #define BPF_CALL_x(x, name, ...) \ static __always_inline \ u64 ____##name(__BPF_MAP(x, __BPF_DECL_ARGS, __BPF_V, __VA_ARGS__)); \ typedef u64 (*btf_##name)(__BPF_MAP(x, __BPF_DECL_ARGS, __BPF_V, __VA_ARGS__)); \ u64 name(__BPF_REG(x, __BPF_DECL_REGS, __BPF_N, __VA_ARGS__)); \ u64 name(__BPF_REG(x, __BPF_DECL_REGS, __BPF_N, __VA_ARGS__)) \ { \ return ((btf_##name)____##name)(__BPF_MAP(x,__BPF_CAST,__BPF_N,__VA_ARGS__));\ } \ static __always_inline \ u64 ____##name(__BPF_MAP(x, __BPF_DECL_ARGS, __BPF_V, __VA_ARGS__)) #define BPF_CALL_1(name, ...) BPF_CALL_x(1, name, __VA_ARGS__) The notrace attribute is actually applied to the static always_inline function ____bpf_spin_{lock,unlock}(). The actual callback function bpf_spin_{lock,unlock}() is not marked with notrace, hence allowing fentry prog to attach to two helpers, and this may cause the above mentioned deadlock. Siddharth Chintamaneni actually has a reproducer in [2]. To fix the issue, a new macro NOTRACE_BPF_CALL_1 is introduced which will add notrace attribute to the original function instead of the hidden always_inline function and this fixed the problem. [1] https://lore.kernel.org/bpf/CAE5sdEigPnoGrzN8WU7Tx-h-iFuMZgW06qp0KHWtpvoXxf1OAQ@mail.gmail.com/ [2] https://lore.kernel.org/bpf/CAE5sdEg6yUc_Jz50AnUXEEUh6O73yQ1Z6NV2srJnef0ZrQkZew@mail.gmail.com/ Fixes: d83525ca62cf ("bpf: introduce bpf_spin_lock") Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20240207070102.335167-1-yonghong.song@linux.dev
2024-02-13bpf: Have bpf_rdonly_cast() take a const pointerDaniel Xu
Since 20d59ee55172 ("libbpf: add bpf_core_cast() macro"), libbpf is now exporting a const arg version of bpf_rdonly_cast(). This causes the following conflicting type error when generating kfunc prototypes from BTF: In file included from skeleton/pid_iter.bpf.c:5: /home/dxu/dev/linux/tools/bpf/bpftool/bootstrap/libbpf/include/bpf/bpf_core_read.h:297:14: error: conflicting types for 'bpf_rdonly_cast' extern void *bpf_rdonly_cast(const void *obj__ign, __u32 btf_id__k) __ksym __weak; ^ ./vmlinux.h:135625:14: note: previous declaration is here extern void *bpf_rdonly_cast(void *obj__ign, u32 btf_id__k) __weak __ksym; This is b/c the kernel defines bpf_rdonly_cast() with non-const arg. Since const arg is more permissive and thus backwards compatible, we change the kernel definition as well to avoid conflicting type errors. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/bpf/dfd3823f11ffd2d4c838e961d61ec9ae8a646773.1707080349.git.dxu@dxuuu.xyz
2024-02-13tracing: Use ring_buffer_record_is_set_on() in tracer_tracing_is_on()Sven Schnelle
tracer_tracing_is_on() checks whether record_disabled is not zero. This checks both the record_disabled counter and the RB_BUFFER_OFF flag. Reading the source it looks like this function should only check for the RB_BUFFER_OFF flag. Therefore use ring_buffer_record_is_set_on(). This fixes spurious fails in the 'test for function traceon/off triggers' test from the ftrace testsuite when the system is under load. Link: https://lore.kernel.org/linux-trace-kernel/20240205065340.2848065-1-svens@linux.ibm.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Tested-By: Mete Durlu <meted@linux.ibm.com> Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-02-13tracing: Fix HAVE_DYNAMIC_FTRACE_WITH_REGS ifdefPetr Pavlu
Commit a8b9cf62ade1 ("ftrace: Fix DIRECT_CALLS to use SAVE_REGS by default") attempted to fix an issue with direct trampolines on x86, see its description for details. However, it wrongly referenced the HAVE_DYNAMIC_FTRACE_WITH_REGS config option and the problem is still present. Add the missing "CONFIG_" prefix for the logic to work as intended. Link: https://lore.kernel.org/linux-trace-kernel/20240213132434.22537-1-petr.pavlu@suse.com Fixes: a8b9cf62ade1 ("ftrace: Fix DIRECT_CALLS to use SAVE_REGS by default") Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-02-11bpf: Allow compiler to inline most of bpf_local_storage_lookup()Marco Elver
In various performance profiles of kernels with BPF programs attached, bpf_local_storage_lookup() appears as a significant portion of CPU cycles spent. To enable the compiler generate more optimal code, turn bpf_local_storage_lookup() into a static inline function, where only the cache insertion code path is outlined Notably, outlining cache insertion helps avoid bloating callers by duplicating setting up calls to raw_spin_{lock,unlock}_irqsave() (on architectures which do not inline spin_lock/unlock, such as x86), which would cause the compiler produce worse code by deciding to outline otherwise inlinable functions. The call overhead is neutral, because we make 2 calls either way: either calling raw_spin_lock_irqsave() and raw_spin_unlock_irqsave(); or call __bpf_local_storage_insert_cache(), which calls raw_spin_lock_irqsave(), followed by a tail-call to raw_spin_unlock_irqsave() where the compiler can perform TCO and (in optimized uninstrumented builds) turns it into a plain jump. The call to __bpf_local_storage_insert_cache() can be elided entirely if cacheit_lockit is a false constant expression. Based on results from './benchs/run_bench_local_storage.sh' (21 trials, reboot between each trial; x86 defconfig + BPF, clang 16) this produces improvements in throughput and latency in the majority of cases, with an average (geomean) improvement of 8%: +---- Hashmap Control -------------------- | | + num keys: 10 | : <before> | <after> | +-+ hashmap (control) sequential get +----------------------+---------------------- | +- hits throughput | 14.789 M ops/s | 14.745 M ops/s ( ~ ) | +- hits latency | 67.679 ns/op | 67.879 ns/op ( ~ ) | +- important_hits throughput | 14.789 M ops/s | 14.745 M ops/s ( ~ ) | | + num keys: 1000 | : <before> | <after> | +-+ hashmap (control) sequential get +----------------------+---------------------- | +- hits throughput | 12.233 M ops/s | 12.170 M ops/s ( ~ ) | +- hits latency | 81.754 ns/op | 82.185 ns/op ( ~ ) | +- important_hits throughput | 12.233 M ops/s | 12.170 M ops/s ( ~ ) | | + num keys: 10000 | : <before> | <after> | +-+ hashmap (control) sequential get +----------------------+---------------------- | +- hits throughput | 7.220 M ops/s | 7.204 M ops/s ( ~ ) | +- hits latency | 138.522 ns/op | 138.842 ns/op ( ~ ) | +- important_hits throughput | 7.220 M ops/s | 7.204 M ops/s ( ~ ) | | + num keys: 100000 | : <before> | <after> | +-+ hashmap (control) sequential get +----------------------+---------------------- | +- hits throughput | 5.061 M ops/s | 5.165 M ops/s (+2.1%) | +- hits latency | 198.483 ns/op | 194.270 ns/op (-2.1%) | +- important_hits throughput | 5.061 M ops/s | 5.165 M ops/s (+2.1%) | | + num keys: 4194304 | : <before> | <after> | +-+ hashmap (control) sequential get +----------------------+---------------------- | +- hits throughput | 2.864 M ops/s | 2.882 M ops/s ( ~ ) | +- hits latency | 365.220 ns/op | 361.418 ns/op (-1.0%) | +- important_hits throughput | 2.864 M ops/s | 2.882 M ops/s ( ~ ) | +---- Local Storage ---------------------- | | + num_maps: 1 | : <before> | <after> | +-+ local_storage cache sequential get +----------------------+---------------------- | +- hits throughput | 33.005 M ops/s | 39.068 M ops/s (+18.4%) | +- hits latency | 30.300 ns/op | 25.598 ns/op (-15.5%) | +- important_hits throughput | 33.005 M ops/s | 39.068 M ops/s (+18.4%) | : | : <before> | <after> | +-+ local_storage cache interleaved get +----------------------+---------------------- | +- hits throughput | 37.151 M ops/s | 44.926 M ops/s (+20.9%) | +- hits latency | 26.919 ns/op | 22.259 ns/op (-17.3%) | +- important_hits throughput | 37.151 M ops/s | 44.926 M ops/s (+20.9%) | | + num_maps: 10 | : <before> | <after> | +-+ local_storage cache sequential get +----------------------+---------------------- | +- hits throughput | 32.288 M ops/s | 38.099 M ops/s (+18.0%) | +- hits latency | 30.972 ns/op | 26.248 ns/op (-15.3%) | +- important_hits throughput | 3.229 M ops/s | 3.810 M ops/s (+18.0%) | : | : <before> | <after> | +-+ local_storage cache interleaved get +----------------------+---------------------- | +- hits throughput | 34.473 M ops/s | 41.145 M ops/s (+19.4%) | +- hits latency | 29.010 ns/op | 24.307 ns/op (-16.2%) | +- important_hits throughput | 12.312 M ops/s | 14.695 M ops/s (+19.4%) | | + num_maps: 16 | : <before> | <after> | +-+ local_storage cache sequential get +----------------------+---------------------- | +- hits throughput | 32.524 M ops/s | 38.341 M ops/s (+17.9%) | +- hits latency | 30.748 ns/op | 26.083 ns/op (-15.2%) | +- important_hits throughput | 2.033 M ops/s | 2.396 M ops/s (+17.9%) | : | : <before> | <after> | +-+ local_storage cache interleaved get +----------------------+---------------------- | +- hits throughput | 34.575 M ops/s | 41.338 M ops/s (+19.6%) | +- hits latency | 28.925 ns/op | 24.193 ns/op (-16.4%) | +- important_hits throughput | 11.001 M ops/s | 13.153 M ops/s (+19.6%) | | + num_maps: 17 | : <before> | <after> | +-+ local_storage cache sequential get +----------------------+---------------------- | +- hits throughput | 28.861 M ops/s | 32.756 M ops/s (+13.5%) | +- hits latency | 34.649 ns/op | 30.530 ns/op (-11.9%) | +- important_hits throughput | 1.700 M ops/s | 1.929 M ops/s (+13.5%) | : | : <before> | <after> | +-+ local_storage cache interleaved get +----------------------+---------------------- | +- hits throughput | 31.529 M ops/s | 36.110 M ops/s (+14.5%) | +- hits latency | 31.719 ns/op | 27.697 ns/op (-12.7%) | +- important_hits throughput | 9.598 M ops/s | 10.993 M ops/s (+14.5%) | | + num_maps: 24 | : <before> | <after> | +-+ local_storage cache sequential get +----------------------+---------------------- | +- hits throughput | 18.602 M ops/s | 19.937 M ops/s (+7.2%) | +- hits latency | 53.767 ns/op | 50.166 ns/op (-6.7%) | +- important_hits throughput | 0.776 M ops/s | 0.831 M ops/s (+7.2%) | : | : <before> | <after> | +-+ local_storage cache interleaved get +----------------------+---------------------- | +- hits throughput | 21.718 M ops/s | 23.332 M ops/s (+7.4%) | +- hits latency | 46.047 ns/op | 42.865 ns/op (-6.9%) | +- important_hits throughput | 6.110 M ops/s | 6.564 M ops/s (+7.4%) | | + num_maps: 32 | : <before> | <after> | +-+ local_storage cache sequential get +----------------------+---------------------- | +- hits throughput | 14.118 M ops/s | 14.626 M ops/s (+3.6%) | +- hits latency | 70.856 ns/op | 68.381 ns/op (-3.5%) | +- important_hits throughput | 0.442 M ops/s | 0.458 M ops/s (+3.6%) | : | : <before> | <after> | +-+ local_storage cache interleaved get +----------------------+---------------------- | +- hits throughput | 17.111 M ops/s | 17.906 M ops/s (+4.6%) | +- hits latency | 58.451 ns/op | 55.865 ns/op (-4.4%) | +- important_hits throughput | 4.776 M ops/s | 4.998 M ops/s (+4.6%) | | + num_maps: 100 | : <before> | <after> | +-+ local_storage cache sequential get +----------------------+---------------------- | +- hits throughput | 5.281 M ops/s | 5.528 M ops/s (+4.7%) | +- hits latency | 192.398 ns/op | 183.059 ns/op (-4.9%) | +- important_hits throughput | 0.053 M ops/s | 0.055 M ops/s (+4.9%) | : | : <before> | <after> | +-+ local_storage cache interleaved get +----------------------+---------------------- | +- hits throughput | 6.265 M ops/s | 6.498 M ops/s (+3.7%) | +- hits latency | 161.436 ns/op | 152.877 ns/op (-5.3%) | +- important_hits throughput | 1.636 M ops/s | 1.697 M ops/s (+3.7%) | | + num_maps: 1000 | : <before> | <after> | +-+ local_storage cache sequential get +----------------------+---------------------- | +- hits throughput | 0.355 M ops/s | 0.354 M ops/s ( ~ ) | +- hits latency | 2826.538 ns/op | 2827.139 ns/op ( ~ ) | +- important_hits throughput | 0.000 M ops/s | 0.000 M ops/s ( ~ ) | : | : <before> | <after> | +-+ local_storage cache interleaved get +----------------------+---------------------- | +- hits throughput | 0.404 M ops/s | 0.403 M ops/s ( ~ ) | +- hits latency | 2481.190 ns/op | 2487.555 ns/op ( ~ ) | +- important_hits throughput | 0.102 M ops/s | 0.101 M ops/s ( ~ ) The on_lookup test in {cgrp,task}_ls_recursion.c is removed because the bpf_local_storage_lookup is no longer traceable and adding tracepoint will make the compiler generate worse code: https://lore.kernel.org/bpf/ZcJmok64Xqv6l4ZS@elver.google.com/ Signed-off-by: Marco Elver <elver@google.com> Cc: Martin KaFai Lau <martin.lau@linux.dev> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20240207122626.3508658-1-elver@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-02-11Merge tag 'timers_urgent_for_v6.8_rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Borislav Petkov: - Make sure a warning is issued when a hrtimer gets queued after the timers have been migrated on the CPU down path and thus said timer will get ignored * tag 'timers_urgent_for_v6.8_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: hrtimer: Report offline hrtimer enqueue
2024-02-10Merge tag 'mm-hotfixes-stable-2024-02-10-11-16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "21 hotfixes. 12 are cc:stable and the remainder pertain to post-6.7 issues or aren't considered to be needed in earlier kernel versions" * tag 'mm-hotfixes-stable-2024-02-10-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (21 commits) nilfs2: fix potential bug in end_buffer_async_write mm/damon/sysfs-schemes: fix wrong DAMOS tried regions update timeout setup nilfs2: fix hang in nilfs_lookup_dirty_data_buffers() MAINTAINERS: Leo Yan has moved mm/zswap: don't return LRU_SKIP if we have dropped lru lock fs,hugetlb: fix NULL pointer dereference in hugetlbs_fill_super mailmap: switch email address for John Moon mm: zswap: fix objcg use-after-free in entry destruction mm/madvise: don't forget to leave lazy MMU mode in madvise_cold_or_pageout_pte_range() arch/arm/mm: fix major fault accounting when retrying under per-VMA lock selftests: core: include linux/close_range.h for CLOSE_RANGE_* macros mm/memory-failure: fix crash in split_huge_page_to_list from soft_offline_page mm: memcg: optimize parent iteration in memcg_rstat_updated() nilfs2: fix data corruption in dsync block recovery for small block sizes mm/userfaultfd: UFFDIO_MOVE implementation should use ptep_get() exit: wait_task_zombie: kill the no longer necessary spin_lock_irq(siglock) fs/proc: do_task_stat: use sig->stats_lock to gather the threads/children stats fs/proc: do_task_stat: move thread_group_cputime_adjusted() outside of lock_task_sighand() getrusage: use sig->stats_lock rather than lock_task_sighand() getrusage: move thread_group_cputime_adjusted() outside of lock_task_sighand() ...
2024-02-10pidfd: change pidfd_send_signal() to respect PIDFD_THREADOleg Nesterov
Turn kill_pid_info() into kill_pid_info_type(), this allows to pass any pid_type to group_send_sig_info(), despite its name it should work fine even if type = PIDTYPE_PID. Change pidfd_send_signal() to use PIDTYPE_PID or PIDTYPE_TGID depending on PIDFD_THREAD. While at it kill another TODO comment in pidfd_show_fdinfo(). As Christian expains fdinfo reports f_flags, userspace can already detect PIDFD_THREAD. Reviewed-by: Tycho Andersen <tandersen@netflix.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20240209130650.GA8048@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-10signal: fill in si_code in prepare_kill_siginfo()Oleg Nesterov
So that do_tkill() can use this helper too. This also simplifies the next patch. TODO: perhaps we can kill prepare_kill_siginfo() and change the callers to use SEND_SIG_NOINFO, but this needs some changes in __send_signal_locked() and TP_STORE_SIGINFO(). Reviewed-by: Tycho Andersen <tandersen@netflix.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20240209130620.GA8039@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-09async: Use a dedicated unbound workqueue with raised min_activeTejun Heo
Async can schedule a number of interdependent work items. However, since 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues"), unbound workqueues have separate min_active which sets the number of interdependent work items that can be handled. This default value is 8 which isn't sufficient for async and can lead to stalls during resume from suspend in some cases. Let's use a dedicated unbound workqueue with raised min_active. Link: http://lkml.kernel.org/r/708a65cc-79ec-44a6-8454-a93d0f3114c3@samsung.com Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-02-09workqueue: Implement workqueue_set_min_active()Tejun Heo
Since 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues"), unbound workqueues have separate min_active which sets the number of interdependent work items that can be handled. This value is currently initialized to WQ_DFL_MIN_ACTIVE which is 8. This isn't high enough for some users, let's add an interface to adjust the setting. Signed-off-by: Tejun Heo <tj@kernel.org>
2024-02-09workqueue: Fix kernel-doc comment of unplug_oldest_pwq()Waiman Long
Fix the kernel-doc comment of the unplug_oldest_pwq() function to enable proper processing and formatting of the embedded ASCII diagram. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-02-09Merge tag 'trace-v6.8-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix broken direct trampolines being called when another callback is attached the same function. ARM 64 does not support FTRACE_WITH_REGS, and when it added direct trampoline calls from ftrace, it removed the "WITH_REGS" flag from the ftrace_ops for direct trampolines. This broke x86 as x86 requires direct trampolines to have WITH_REGS. This wasn't noticed because direct trampolines work as long as the function it is attached to is not shared with other callbacks (like the function tracer). When there are other callbacks, a helper trampoline is called, to call all the non direct callbacks and when it returns, the direct trampoline is called. For x86, the direct trampoline sets a flag in the regs field to tell the x86 specific code to call the direct trampoline. But this only works if the ftrace_ops had WITH_REGS set. ARM does things differently that does not require this. For now, set WITH_REGS if the arch supports WITH_REGS (which ARM does not), and this makes it work for both ARM64 and x86. - Fix wasted memory in the saved_cmdlines logic. The saved_cmdlines is a cache that maps PIDs to COMMs that tracing can use. Most trace events only save the PID in the event. The saved_cmdlines file lists PIDs to COMMs so that the tracing tools can show an actual name and not just a PID for each event. There's an array of PIDs that map to a small set of saved COMM strings. The array is set to PID_MAX_DEFAULT which is usually set to 32768. When a PID comes in, it will add itself to this array along with the index into the COMM array (note if the system allows more than PID_MAX_DEFAULT, this cache is similar to cache lines as an update of a PID that has the same PID_MAX_DEFAULT bits set will flush out another task with the same matching bits set). A while ago, the size of this cache was changed to be dynamic and the array was moved into a structure and created with kmalloc(). But this new structure had the size of 131104 bytes, or 0x20020 in hex. As kmalloc allocates in powers of two, it was actually allocating 0x40000 bytes (262144) leaving 131040 bytes of wasted memory. The last element of this structure was a pointer to the COMM string array which defaulted to just saving 128 COMMs. By changing the last field of this structure to a variable length string, and just having it round up to fill the allocated memory, the default size of the saved COMM cache is now 8190. This not only uses the wasted space, but actually saves space by removing the extra allocation for the COMM names. * tag 'trace-v6.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Fix wasted memory in saved_cmdlines logic ftrace: Fix DIRECT_CALLS to use SAVE_REGS by default
2024-02-09tracing: Fix wasted memory in saved_cmdlines logicSteven Rostedt (Google)
While looking at improving the saved_cmdlines cache I found a huge amount of wasted memory that should be used for the cmdlines. The tracing data saves pids during the trace. At sched switch, if a trace occurred, it will save the comm of the task that did the trace. This is saved in a "cache" that maps pids to comms and exposed to user space via the /sys/kernel/tracing/saved_cmdlines file. Currently it only caches by default 128 comms. The structure that uses this creates an array to store the pids using PID_MAX_DEFAULT (which is usually set to 32768). This causes the structure to be of the size of 131104 bytes on 64 bit machines. In hex: 131104 = 0x20020, and since the kernel allocates generic memory in powers of two, the kernel would allocate 0x40000 or 262144 bytes to store this structure. That leaves 131040 bytes of wasted space. Worse, the structure points to an allocated array to store the comm names, which is 16 bytes times the amount of names to save (currently 128), which is 2048 bytes. Instead of allocating a separate array, make the structure end with a variable length string and use the extra space for that. This is similar to a recommendation that Linus had made about eventfs_inode names: https://lore.kernel.org/all/20240130190355.11486-5-torvalds@linux-foundation.org/ Instead of allocating a separate string array to hold the saved comms, have the structure end with: char saved_cmdlines[]; and round up to the next power of two over sizeof(struct saved_cmdline_buffers) + num_cmdlines * TASK_COMM_LEN It will use this extra space for the saved_cmdline portion. Now, instead of saving only 128 comms by default, by using this wasted space at the end of the structure it can save over 8000 comms and even saves space by removing the need for allocating the other array. Link: https://lore.kernel.org/linux-trace-kernel/20240209063622.1f7b6d5f@rorschach.local.home Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Mete Durlu <meted@linux.ibm.com> Fixes: 939c7a4f04fcd ("tracing: Introduce saved_cmdlines_size file") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-02-09ftrace: Fix DIRECT_CALLS to use SAVE_REGS by defaultMasami Hiramatsu (Google)
The commit 60c8971899f3 ("ftrace: Make DIRECT_CALLS work WITH_ARGS and !WITH_REGS") changed DIRECT_CALLS to use SAVE_ARGS when there are multiple ftrace_ops at the same function, but since the x86 only support to jump to direct_call from ftrace_regs_caller, when we set the function tracer on the same target function on x86, ftrace-direct does not work as below (this actually works on arm64.) At first, insmod ftrace-direct.ko to put a direct_call on 'wake_up_process()'. # insmod kernel/samples/ftrace/ftrace-direct.ko # less trace ... <idle>-0 [006] ..s1. 564.686958: my_direct_func: waking up rcu_preempt-17 <idle>-0 [007] ..s1. 564.687836: my_direct_func: waking up kcompactd0-63 <idle>-0 [006] ..s1. 564.690926: my_direct_func: waking up rcu_preempt-17 <idle>-0 [006] ..s1. 564.696872: my_direct_func: waking up rcu_preempt-17 <idle>-0 [007] ..s1. 565.191982: my_direct_func: waking up kcompactd0-63 Setup a function filter to the 'wake_up_process' too, and enable it. # cd /sys/kernel/tracing/ # echo wake_up_process > set_ftrace_filter # echo function > current_tracer # less trace ... <idle>-0 [006] ..s3. 686.180972: wake_up_process <-call_timer_fn <idle>-0 [006] ..s3. 686.186919: wake_up_process <-call_timer_fn <idle>-0 [002] ..s3. 686.264049: wake_up_process <-call_timer_fn <idle>-0 [002] d.h6. 686.515216: wake_up_process <-kick_pool <idle>-0 [002] d.h6. 686.691386: wake_up_process <-kick_pool Then, only function tracer is shown on x86. But if you enable 'kprobe on ftrace' event (which uses SAVE_REGS flag) on the same function, it is shown again. # echo 'p wake_up_process' >> dynamic_events # echo 1 > events/kprobes/p_wake_up_process_0/enable # echo > trace # less trace ... <idle>-0 [006] ..s2. 2710.345919: p_wake_up_process_0: (wake_up_process+0x4/0x20) <idle>-0 [006] ..s3. 2710.345923: wake_up_process <-call_timer_fn <idle>-0 [006] ..s1. 2710.345928: my_direct_func: waking up rcu_preempt-17 <idle>-0 [006] ..s2. 2710.349931: p_wake_up_process_0: (wake_up_process+0x4/0x20) <idle>-0 [006] ..s3. 2710.349934: wake_up_process <-call_timer_fn <idle>-0 [006] ..s1. 2710.349937: my_direct_func: waking up rcu_preempt-17 To fix this issue, use SAVE_REGS flag for multiple ftrace_ops flag of direct_call by default. Link: https://lore.kernel.org/linux-trace-kernel/170484558617.178953.1590516949390270842.stgit@devnote2 Fixes: 60c8971899f3 ("ftrace: Make DIRECT_CALLS work WITH_ARGS and !WITH_REGS") Cc: stable@vger.kernel.org Cc: Florent Revest <revest@chromium.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Tested-by: Mark Rutland <mark.rutland@arm.com> [arm64] Acked-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-02-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: drivers/net/ethernet/stmicro/stmmac/common.h 38cc3c6dcc09 ("net: stmmac: protect updates of 64-bit statistics counters") fd5a6a71313e ("net: stmmac: est: Per Tx-queue error count for HLBF") c5c3e1bfc9e0 ("net: stmmac: Offload queueMaxSDU from tc-taprio") drivers/net/wireless/microchip/wilc1000/netdev.c c9013880284d ("wifi: fill in MODULE_DESCRIPTION()s for wilc1000") 328efda22af8 ("wifi: wilc1000: do not realloc workqueue everytime an interface is added") net/unix/garbage.c 11498715f266 ("af_unix: Remove io_uring code for GC.") 1279f9d9dec2 ("af_unix: Call kfree_skb() for dead unix_(sk)->oob_skb in GC.") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-08bpf, btf: Check btf for register_bpf_struct_opsGeliang Tang
Similar to the handling in the functions __register_btf_kfunc_id_set() and register_btf_id_dtor_kfuncs(), this patch uses the newly added helper check_btf_kconfigs() to handle module with its btf section stripped. While at it, the patch also adds the missed IS_ERR() check to fix the commit f6be98d19985 ("bpf, net: switch to dynamic registration") Fixes: f6be98d19985 ("bpf, net: switch to dynamic registration") Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Link: https://lore.kernel.org/r/69082b9835463fe36f9e354bddf2d0a97df39c2b.1707373307.git.tanggeliang@kylinos.cn Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-02-08workqueue: Bind unbound workqueue rescuer to wq_unbound_cpumaskWaiman Long
Commit 85f0ab43f9de ("kernel/workqueue: Bind rescuer to unbound cpumask for WQ_UNBOUND") modified init_rescuer() to bind rescuer of an unbound workqueue to the cpumask in wq->unbound_attrs. However unbound_attrs->cpumask's of all workqueues are initialized to cpu_possible_mask and will only be changed if it has the WQ_SYSFS flag to expose a cpumask sysfs file to be written by users. So this patch doesn't achieve what it is intended to do. If an unbound workqueue is created after wq_unbound_cpumask is modified and there is no more unbound cpumask update after that, the unbound rescuer will be bound to all CPUs unless the workqueue is created with the WQ_SYSFS flag and a user explicitly modified its cpumask sysfs file. Fix this problem by binding directly to wq_unbound_cpumask in init_rescuer(). Fixes: 85f0ab43f9de ("kernel/workqueue: Bind rescuer to unbound cpumask for WQ_UNBOUND") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-02-08kernel/workqueue: Let rescuers follow unbound wq cpumask changesJuri Lelli
When workqueue cpumask changes are committed the associated rescuer (if one exists) affinity is not touched and this might be a problem down the line for isolated setups. Make sure rescuers affinity is updated every time a workqueue cpumask changes, so that rescuers can't break isolation. [longman: set_cpus_allowed_ptr() will block until the designated task is enqueued on an allowed CPU, no wake_up_process() needed. Also use the unbound_effective_cpumask() helper as suggested by Tejun.] Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-02-08bpf, btf: Add check_btf_kconfigs helperGeliang Tang
This patch extracts duplicate code on error path when btf_get_module_btf() returns NULL from the functions __register_btf_kfunc_id_set() and register_btf_id_dtor_kfuncs() into a new helper named check_btf_kconfigs() to check CONFIG_DEBUG_INFO_BTF and CONFIG_DEBUG_INFO_BTF_MODULES in it. Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/fa5537fc55f1e4d0bfd686598c81b7ab9dbd82b7.1707373307.git.tanggeliang@kylinos.cn Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-02-08workqueue: Enable unbound cpumask update on ordered workqueuesWaiman Long
Ordered workqueues does not currently follow changes made to the global unbound cpumask because per-pool workqueue changes may break the ordering guarantee. IOW, a work function in an ordered workqueue may run on an isolated CPU. This patch enables ordered workqueues to follow changes made to the global unbound cpumask by temporaily plug or suspend the newly allocated pool_workqueue from executing newly queued work items until the old pwq has been properly drained. For ordered workqueues, there should only be one pwq that is unplugged, the rests should be plugged. This enables ordered workqueues to follow the unbound cpumask changes like other unbound workqueues at the expense of some delay in execution of work functions during the transition period. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-02-08workqueue: Link pwq's into wq->pwqs from oldest to newestWaiman Long
Add a new pwq into the tail of wq->pwqs so that pwq iteration will start from the oldest pwq to the newest. This ordering will facilitate the inclusion of ordered workqueues in a wq_unbound_cpumask update. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-02-08bpf, btf: Fix return value of register_btf_id_dtor_kfuncsGeliang Tang
The same as __register_btf_kfunc_id_set(), to let the modules with stripped btf section loaded, this patch changes the return value of register_btf_id_dtor_kfuncs() too from -ENOENT to 0 when btf is NULL. Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Link: https://lore.kernel.org/r/eab65586d7fb0e72f2707d3747c7d4a5d60c823f.1707373307.git.tanggeliang@kylinos.cn Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-02-08workqueue: make wq_subsys constRicardo B. Marliere
Now that the driver core can properly handle constant struct bus_type, move the wq_subsys variable to be a constant structure as well, placing it into read-only memory which can not be modified at runtime. Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: "Ricardo B. Marliere" <ricardo@marliere.net> Cc: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20240206-bus_cleanup-workqueue-v1-1-72b10d282d58@marliere.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-02-08kprobes: Remove unnecessary initial values of variablesLi zeming
ri and sym is assigned first, so it does not need to initialize the assignment. Link: https://lore.kernel.org/all/20230919012823.7815-1-zeming@nfschina.com/ Signed-off-by: Li zeming <zeming@nfschina.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-02-08tracing/probes: Fix to set arg size and fmt after setting type from BTFMasami Hiramatsu (Google)
Since the BTF type setting updates probe_arg::type, the type size calculation and setting print-fmt should be done after that. Without this fix, the argument size and print-fmt can be wrong. Link: https://lore.kernel.org/all/170602218196.215583.6417859469540955777.stgit@devnote2/ Fixes: b576e09701c7 ("tracing/probes: Support function parameters if BTF is available") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>