summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2023-01-31cgroup/cpuset: Fix wrong check in update_parent_subparts_cpumask()Waiman Long
It was found that the check to see if a partition could use up all the cpus from the parent cpuset in update_parent_subparts_cpumask() was incorrect. As a result, it is possible to leave parent with no effective cpu left even if there are tasks in the parent cpuset. This can lead to system panic as reported in [1]. Fix this probem by updating the check to fail the enabling the partition if parent's effective_cpus is a subset of the child's cpus_allowed. Also record the error code when an error happens in update_prstate() and add a test case where parent partition and child have the same cpu list and parent has task. Enabling partition in the child will fail in this case. [1] https://www.spinics.net/lists/cgroups/msg36254.html Fixes: f0af1bfc27b5 ("cgroup/cpuset: Relax constraints to partition & cpus changes") Cc: stable@vger.kernel.org # v6.1 Reported-by: Srinivas Pandruvada <srinivas.pandruvada@intel.com> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2023-01-31perf: Fix perf_event_pmu_context serializationJames Clark
Syzkaller triggered a WARN in put_pmu_ctx(). WARNING: CPU: 1 PID: 2245 at kernel/events/core.c:4925 put_pmu_ctx+0x1f0/0x278 This is because there is no locking around the access of "if (!epc->ctx)" in find_get_pmu_context() and when it is set to NULL in put_pmu_ctx(). The decrement of the reference count in put_pmu_ctx() also happens outside of the spinlock, leading to the possibility of this order of events, and the context being cleared in put_pmu_ctx(), after its refcount is non zero: CPU0 CPU1 find_get_pmu_context() if (!epc->ctx) == false put_pmu_ctx() atomic_dec_and_test(&epc->refcount) == true epc->refcount == 0 atomic_inc(&epc->refcount); epc->refcount == 1 list_del_init(&epc->pmu_ctx_entry); epc->ctx = NULL; Another issue is that WARN_ON for no active PMU events in put_pmu_ctx() is outside of the lock. If the perf_event_pmu_context is an embedded one, even after clearing it, it won't be deleted and can be re-used. So the warning can trigger. For this reason it also needs to be moved inside the lock. The above warning is very quick to trigger on Arm by running these two commands at the same time: while true; do perf record -- ls; done while true; do perf record -- ls; done [peterz: atomic_dec_and_raw_lock*()] Fixes: bd2756811766 ("perf: Rewrite core context handling") Reported-by: syzbot+697196bc0265049822bd@syzkaller.appspotmail.com Signed-off-by: James Clark <james.clark@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20230127143141.1782804-2-james.clark@arm.com
2023-01-31sched/clock: Make local_clock() noinstrPeter Zijlstra
With sched_clock() noinstr, provide a noinstr implementation of local_clock(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230126151323.760767043@infradead.org
2023-01-31cpuidle: tracing, preempt: Squash _rcuidle tracingPeter Zijlstra
Extend/fix commit: 9aedeaed6fc6 ("tracing, hardirq: No moar _rcuidle() tracing") ... to also cover trace_preempt_{on,off}() which were mysteriously untouched. Fixes: 9aedeaed6fc6 ("tracing, hardirq: No moar _rcuidle() tracing") Reported-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Mark Rutland <mark.rutland@arm.com> Link: https://lkml.kernel.org/r/Y9D5AfnOukWNOZ5q@hirez.programming.kicks-ass.net Link: https://lore.kernel.org/r/Y9jWXKgkxY5EZVwW@hirez.programming.kicks-ass.net
2023-01-31cpuidle: lib/bug: Disable rcu_is_watching() during WARN/BUGPeter Zijlstra
In order to avoid WARN/BUG from generating nested or even recursive warnings, force rcu_is_watching() true during WARN/lockdep_rcu_suspicious(). Notably things like unwinding the stack can trigger rcu_dereference() warnings, which then triggers more unwinding which then triggers more warnings etc.. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230126151323.408156109@infradead.org
2023-01-31Merge tag 'v6.2-rc6' into sched/core, to pick up fixesIngo Molnar
Pick up fixes before merging another batch of cpuidle updates. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-01-31hrtimer: Ignore slack time for RT tasks in schedule_hrtimeout_range()Davidlohr Bueso
While in theory the timer can be triggered before expires + delta, for the cases of RT tasks they really have no business giving any lenience for extra slack time, so override any passed value by the user and always use zero for schedule_hrtimeout_range() calls. Furthermore, this is similar to what the nanosleep(2) family already does with current->timer_slack_ns. Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20230123173206.6764-3-dave@stgolabs.net
2023-01-31hrtimer: Rely on rt_task() for DL tasks tooDavidlohr Bueso
Checking dl_task() is redundant as rt_task() returns true for deadline tasks too. Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20230123173206.6764-2-dave@stgolabs.net
2023-01-29Merge tag 'irq_urgent_for_v6.2_rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fix from Borislav Petkov: - Cleanup the firmware node for the new IRQ MSI domain properly, to avoid leaking memory * tag 'irq_urgent_for_v6.2_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq/msi: Free the fwnode created by msi_create_device_irq_domain()
2023-01-28bpf: btf: Add BTF_FMODEL_SIGNED_ARG flagIlya Leoshkevich
s390x eBPF JIT needs to know whether a function return value is signed and which function arguments are signed, in order to generate code compliant with the s390x ABI. Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Link: https://lore.kernel.org/r/20230128000650.1516334-26-iii@linux.ibm.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-28bpf: iterators: Split iterators.lskel.h into little- and big- endian versionsIlya Leoshkevich
iterators.lskel.h is little-endian, therefore bpf iterator is currently broken on big-endian systems. Introduce a big-endian version and add instructions regarding its generation. Unfortunately bpftool's cross-endianness capabilities are limited to BTF right now, so the procedure requires access to a big-endian machine or a configured emulator. Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Link: https://lore.kernel.org/r/20230128000650.1516334-25-iii@linux.ibm.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-28bpf: Build-time assert that cpumask offset is zeroDavid Vernet
The first element of a struct bpf_cpumask is a cpumask_t. This is done to allow struct bpf_cpumask to be cast to a struct cpumask. If this element were ever moved to another field, any BPF program passing a struct bpf_cpumask * to a kfunc expecting a const struct cpumask * would immediately fail to load. Add a build-time assertion so this is assumption is captured and verified. Signed-off-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20230128141537.100777-1-void@manifault.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-28Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== bpf-next 2023-01-28 We've added 124 non-merge commits during the last 22 day(s) which contain a total of 124 files changed, 6386 insertions(+), 1827 deletions(-). The main changes are: 1) Implement XDP hints via kfuncs with initial support for RX hash and timestamp metadata kfuncs, from Stanislav Fomichev and Toke Høiland-Jørgensen. Measurements on overhead: https://lore.kernel.org/bpf/875yellcx6.fsf@toke.dk 2) Extend libbpf's bpf_tracing.h support for tracing arguments of kprobes/uprobes and syscall as a special case, from Andrii Nakryiko. 3) Significantly reduce the search time for module symbols by livepatch and BPF, from Jiri Olsa and Zhen Lei. 4) Enable cpumasks to be used as kptrs, which is useful for tracing programs tracking which tasks end up running on which CPUs in different time intervals, from David Vernet. 5) Fix several issues in the dynptr processing such as stack slot liveness propagation, missing checks for PTR_TO_STACK variable offset, etc, from Kumar Kartikeya Dwivedi. 6) Various performance improvements, fixes, and introduction of more than just one XDP program to XSK selftests, from Magnus Karlsson. 7) Big batch to BPF samples to reduce deprecated functionality, from Daniel T. Lee. 8) Enable struct_ops programs to be sleepable in verifier, from David Vernet. 9) Reduce pr_warn() noise on BTF mismatches when they are expected under the CONFIG_MODULE_ALLOW_BTF_MISMATCH config anyway, from Connor O'Brien. 10) Describe modulo and division by zero behavior of the BPF runtime in BPF's instruction specification document, from Dave Thaler. 11) Several improvements to libbpf API documentation in libbpf.h, from Grant Seltzer. 12) Improve resolve_btfids header dependencies related to subcmd and add proper support for HOSTCC, from Ian Rogers. 13) Add ipip6 and ip6ip decapsulation support for bpf_skb_adjust_room() helper along with BPF selftests, from Ziyang Xuan. 14) Simplify the parsing logic of structure parameters for BPF trampoline in the x86-64 JIT compiler, from Pu Lehui. 15) Get BTF working for kernels with CONFIG_RUST enabled by excluding Rust compilation units with pahole, from Martin Rodriguez Reboredo. 16) Get bpf_setsockopt() working for kTLS on top of TCP sockets, from Kui-Feng Lee. 17) Disable stack protection for BPF objects in bpftool given BPF backends don't support it, from Holger Hoffstätte. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (124 commits) selftest/bpf: Make crashes more debuggable in test_progs libbpf: Add documentation to map pinning API functions libbpf: Fix malformed documentation formatting selftests/bpf: Properly enable hwtstamp in xdp_hw_metadata selftests/bpf: Calls bpf_setsockopt() on a ktls enabled socket. bpf: Check the protocol of a sock to agree the calls to bpf_setsockopt(). bpf/selftests: Verify struct_ops prog sleepable behavior bpf: Pass const struct bpf_prog * to .check_member libbpf: Support sleepable struct_ops.s section bpf: Allow BPF_PROG_TYPE_STRUCT_OPS programs to be sleepable selftests/bpf: Fix vmtest static compilation error tools/resolve_btfids: Alter how HOSTCC is forced tools/resolve_btfids: Install subcmd headers bpf/docs: Document the nocast aliasing behavior of ___init bpf/docs: Document how nested trusted fields may be defined bpf/docs: Document cpumask kfuncs in a new file selftests/bpf: Add selftest suite for cpumask kfuncs selftests/bpf: Add nested trust selftests suite bpf: Enable cpumasks to be queried and used as kptrs bpf: Disallow NULLable pointers for trusted kfuncs ... ==================== Link: https://lore.kernel.org/r/20230128004827.21371-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-27Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== bpf 2023-01-27 We've added 10 non-merge commits during the last 9 day(s) which contain a total of 10 files changed, 170 insertions(+), 59 deletions(-). The main changes are: 1) Fix preservation of register's parent/live fields when copying range-info, from Eduard Zingerman. 2) Fix an off-by-one bug in bpf_mem_cache_idx() to select the right cache, from Hou Tao. 3) Fix stack overflow from infinite recursion in sock_map_close(), from Jakub Sitnicki. 4) Fix missing btf_put() in register_btf_id_dtor_kfuncs()'s error path, from Jiri Olsa. 5) Fix a splat from bpf_setsockopt() via lsm_cgroup/socket_sock_rcv_skb, from Kui-Feng Lee. 6) Fix bpf_send_signal[_thread]() helpers to hold a reference on the task, from Yonghong Song. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf: Fix the kernel crash caused by bpf_setsockopt(). selftests/bpf: Cover listener cloning with progs attached to sockmap selftests/bpf: Pass BPF skeleton to sockmap_listen ops tests bpf, sockmap: Check for any of tcp_bpf_prots when cloning a listener bpf, sockmap: Don't let sock_map_{close,destroy,unhash} call itself bpf: Add missing btf_put to register_btf_id_dtor_kfuncs selftests/bpf: Verify copy_register_state() preserves parent/live fields bpf: Fix to preserve reg parent/live fields when copying range info bpf: Fix a possible task gone issue with bpf_send_signal[_thread]() helpers bpf: Fix off-by-one error in bpf_mem_cache_idx() ==================== Link: https://lore.kernel.org/r/20230127215820.4993-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Conflicts: drivers/net/ethernet/intel/ice/ice_main.c 418e53401e47 ("ice: move devlink port creation/deletion") 643ef23bd9dd ("ice: Introduce local var for readability") https://lore.kernel.org/all/20230127124025.0dacef40@canb.auug.org.au/ https://lore.kernel.org/all/20230124005714.3996270-1-anthony.l.nguyen@intel.com/ drivers/net/ethernet/engleder/tsnep_main.c 3d53aaef4332 ("tsnep: Fix TX queue stop/wake for multiple queues") 25faa6a4c5ca ("tsnep: Replace TX spin_lock with __netif_tx_lock") https://lore.kernel.org/all/20230127123604.36bb3e99@canb.auug.org.au/ net/netfilter/nf_conntrack_proto_sctp.c 13bd9b31a969 ("Revert "netfilter: conntrack: add sctp DATA_SENT state"") a44b7651489f ("netfilter: conntrack: unify established states for SCTP paths") f71cb8f45d09 ("netfilter: conntrack: sctp: use nf log infrastructure for invalid packets") https://lore.kernel.org/all/20230127125052.674281f9@canb.auug.org.au/ https://lore.kernel.org/all/d36076f3-6add-a442-6d4b-ead9f7ffff86@tessares.net/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-27Merge tag 'trace-v6.2-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix filter memory leak by calling ftrace_free_filter() - Initialize trace_printk() earlier so that ftrace_dump_on_oops shows data on early crashes. - Update the outdated instructions in scripts/tracing/ftrace-bisect.sh - Add lockdep_is_held() to fix lockdep warning - Add allocation failure check in create_hist_field() - Don't initialize pointer that gets set right away in enabled_monitors_write() - Update MAINTAINER entries - Fix help messages in Kconfigs - Fix kernel-doc header for update_preds() * tag 'trace-v6.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: bootconfig: Update MAINTAINERS file to add tree and mailing list rv: remove redundant initialization of pointer ptr ftrace: Maintain samples/ftrace tracing/filter: fix kernel-doc warnings lib: Kconfig: fix spellos trace_events_hist: add check for return value of 'create_hist_field' tracing/osnoise: Use built-in RCU list checking tracing: Kconfig: Fix spelling/grammar/punctuation ftrace/scripts: Update the instructions for ftrace-bisect.sh tracing: Make sure trace_printk() can output as soon as it can be used ftrace: Export ftrace_free_filter() to modules
2023-01-26bpf: Fix the kernel crash caused by bpf_setsockopt().Kui-Feng Lee
The kernel crash was caused by a BPF program attached to the "lsm_cgroup/socket_sock_rcv_skb" hook, which performed a call to `bpf_setsockopt()` in order to set the TCP_NODELAY flag as an example. Flags like TCP_NODELAY can prompt the kernel to flush a socket's outgoing queue, and this hook "lsm_cgroup/socket_sock_rcv_skb" is frequently triggered by softirqs. The issue was that in certain circumstances, when `tcp_write_xmit()` was called to flush the queue, it would also allow BH (bottom-half) to run. This could lead to our program attempting to flush the same socket recursively, which caused a `skbuff` to be unlinked twice. `security_sock_rcv_skb()` is triggered by `tcp_filter()`. This occurs before the sock ownership is checked in `tcp_v4_rcv()`. Consequently, if a bpf program runs on `security_sock_rcv_skb()` while under softirq conditions, it may not possess the lock needed for `bpf_setsockopt()`, thus presenting an issue. The patch fixes this issue by ensuring that a BPF program attached to the "lsm_cgroup/socket_sock_rcv_skb" hook is not allowed to call `bpf_setsockopt()`. The differences from v1 are - changing commit log to explain holding the lock of the sock, - emphasizing that TCP_NODELAY is not the only flag, and - adding the fixes tag. v1: https://lore.kernel.org/bpf/20230125000244.1109228-1-kuifeng@meta.com/ Signed-off-by: Kui-Feng Lee <kuifeng@meta.com> Fixes: 9113d7e48e91 ("bpf: expose bpf_{g,s}etsockopt to lsm cgroup") Link: https://lore.kernel.org/r/20230127001732.4162630-1-kuifeng@meta.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-01-26locking/rwsem: Disable preemption in all down_write*() and up_write() code pathsWaiman Long
The previous patch has disabled preemption in all the down_read() and up_read() code paths. For symmetry, this patch extends commit: 48dfb5d2560d ("locking/rwsem: Disable preemption while trying for rwsem lock") ... to have preemption disabled in all the down_write() and up_write() code paths, including downgrade_write(). Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230126003628.365092-4-longman@redhat.com
2023-01-26locking/rwsem: Disable preemption in all down_read*() and up_read() code pathsWaiman Long
Commit: 91d2a812dfb9 ("locking/rwsem: Make handoff writer optimistically spin on owner") ... assumes that when the owner field is changed to NULL, the lock will become free soon. But commit: 48dfb5d2560d ("locking/rwsem: Disable preemption while trying for rwsem lock") ... disabled preemption when acquiring rwsem for write. However, preemption has not yet been disabled when acquiring a read lock on a rwsem. So a reader can add a RWSEM_READER_BIAS to count without setting owner to signal a reader, got preempted out by a RT task which then spins in the writer slowpath as owner remains NULL leading to live lock. One easy way to fix this problem is to disable preemption at all the down_read*() and up_read() code paths as implemented in this patch. Fixes: 91d2a812dfb9 ("locking/rwsem: Make handoff writer optimistically spin on owner") Reported-by: Mukesh Ojha <quic_mojha@quicinc.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230126003628.365092-3-longman@redhat.com
2023-01-26locking/rwsem: Prevent non-first waiter from spinning in down_write() slowpathWaiman Long
A non-first waiter can potentially spin in the for loop of rwsem_down_write_slowpath() without sleeping but fail to acquire the lock even if the rwsem is free if the following sequence happens: Non-first RT waiter First waiter Lock holder ------------------- ------------ ----------- Acquire wait_lock rwsem_try_write_lock(): Set handoff bit if RT or wait too long Set waiter->handoff_set Release wait_lock Acquire wait_lock Inherit waiter->handoff_set Release wait_lock Clear owner Release lock if (waiter.handoff_set) { rwsem_spin_on_owner((); if (OWNER_NULL) goto trylock_again; } trylock_again: Acquire wait_lock rwsem_try_write_lock(): if (first->handoff_set && (waiter != first)) return false; Release wait_lock A non-first waiter cannot really acquire the rwsem even if it mistakenly believes that it can spin on OWNER_NULL value. If that waiter happens to be an RT task running on the same CPU as the first waiter, it can block the first waiter from acquiring the rwsem leading to live lock. Fix this problem by making sure that a non-first waiter cannot spin in the slowpath loop without sleeping. Fixes: d257cc8cb8d5 ("locking/rwsem: Make handoff bit handling more consistent") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Mukesh Ojha <quic_mojha@quicinc.com> Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230126003628.365092-2-longman@redhat.com
2023-01-26Merge tag 'v6.2-rc5' into locking/core, to pick up fixesIngo Molnar
Refresh this branch with the latest locking fixes, before applying a new series of changes. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-01-25module: Use kstrtobool() instead of strtobool()Christophe JAILLET
strtobool() is the same as kstrtobool(). However, the latter is more used within the kernel. In order to remove strtobool() and slightly simplify kstrtox.h, switch to the other function name. While at it, include the corresponding header file (<linux/kstrtox.h>) Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Aaron Tomlin <atomlin@atomlin.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-01-25kernel/params.c: Use kstrtobool() instead of strtobool()Christophe JAILLET
strtobool() is the same as kstrtobool(). However, the latter is more used within the kernel. In order to remove strtobool() and slightly simplify kstrtox.h, switch to the other function name. While at it, include the corresponding header file (<linux/kstrtox.h>) Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-01-25bpf: Pass const struct bpf_prog * to .check_memberDavid Vernet
The .check_member field of struct bpf_struct_ops is currently passed the member's btf_type via const struct btf_type *t, and a const struct btf_member *member. This allows the struct_ops implementation to check whether e.g. an ops is supported, but it would be useful to also enforce that the struct_ops prog being loaded for that member has other qualities, like being sleepable (or not). This patch therefore updates the .check_member() callback to also take a const struct bpf_prog *prog argument. Signed-off-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20230125164735.785732-4-void@manifault.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-25bpf: Allow BPF_PROG_TYPE_STRUCT_OPS programs to be sleepableDavid Vernet
BPF struct_ops programs currently cannot be marked as sleepable. This need not be the case -- struct_ops programs can be sleepable, and e.g. invoke kfuncs that export the KF_SLEEPABLE flag. So as to allow future struct_ops programs to invoke such kfuncs, this patch updates the verifier to allow struct_ops programs to be sleepable. A follow-on patch will add support to libbpf for specifying struct_ops.s as a sleepable struct_ops program, and then another patch will add testcases to the dummy_st_ops selftest suite which test sleepable struct_ops behavior. Signed-off-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20230125164735.785732-2-void@manifault.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-25bpf/docs: Document cpumask kfuncs in a new fileDavid Vernet
Now that we've added a series of new cpumask kfuncs, we should document them so users can easily use them. This patch adds a new cpumasks.rst file to document them. Signed-off-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20230125143816.721952-6-void@manifault.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-25bpf: Enable cpumasks to be queried and used as kptrsDavid Vernet
Certain programs may wish to be able to query cpumasks. For example, if a program that is tracing percpu operations wishes to track which tasks end up running on which CPUs, it could be useful to associate that with the tasks' cpumasks. Similarly, programs tracking NUMA allocations, CPU scheduling domains, etc, could potentially benefit from being able to see which CPUs a task could be migrated to. This patch enables these types of use cases by introducing a series of bpf_cpumask_* kfuncs. Amongst these kfuncs, there are two separate "classes" of operations: 1. kfuncs which allow the caller to allocate and mutate their own cpumask kptrs in the form of a struct bpf_cpumask * object. Such kfuncs include e.g. bpf_cpumask_create() to allocate the cpumask, and bpf_cpumask_or() to mutate it. "Regular" cpumasks such as p->cpus_ptr may not be passed to these kfuncs, and the verifier will ensure this is the case by comparing BTF IDs. 2. Read-only operations which operate on const struct cpumask * arguments. For example, bpf_cpumask_test_cpu(), which tests whether a CPU is set in the cpumask. Any trusted struct cpumask * or struct bpf_cpumask * may be passed to these kfuncs. The verifier allows struct bpf_cpumask * even though the kfunc is defined with struct cpumask * because the first element of a struct bpf_cpumask is a cpumask_t, so it is safe to cast. A follow-on patch will add selftests which validate these kfuncs, and another will document them. Signed-off-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20230125143816.721952-3-void@manifault.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-25bpf: Disallow NULLable pointers for trusted kfuncsDavid Vernet
KF_TRUSTED_ARGS kfuncs currently have a subtle and insidious bug in validating pointers to scalars. Say that you have a kfunc like the following, which takes an array as the first argument: bool bpf_cpumask_empty(const struct cpumask *cpumask) { return cpumask_empty(cpumask); } ... BTF_ID_FLAGS(func, bpf_cpumask_empty, KF_TRUSTED_ARGS) ... If a BPF program were to invoke the kfunc with a NULL argument, it would crash the kernel. The reason is that struct cpumask is defined as a bitmap, which is itself defined as an array, and is accessed as a memory address by bitmap operations. So when the verifier analyzes the register, it interprets it as a pointer to a scalar struct, which is an array of size 8. check_mem_reg() then sees that the register is NULL and returns 0, and the kfunc crashes when it passes it down to the cpumask wrappers. To fix this, this patch adds a check for KF_ARG_PTR_TO_MEM which verifies that the register doesn't contain a possibly-NULL pointer if the kfunc is KF_TRUSTED_ARGS. Signed-off-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20230125143816.721952-2-void@manifault.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-25tracing/histogram: Add simple tests for stacktrace usage of synthetic eventsSteven Rostedt (Google)
Update the selftests to include a test of passing a stacktrace between the events of a synthetic event. Link: https://lkml.kernel.org/r/20230117152236.475439286@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Ross Zwisler <zwisler@google.com> Cc: Ching-lin Yu <chinglinyu@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-01-25tracing/histogram: Add stacktrace typeSteven Rostedt (Google)
Now that stacktraces can be part of synthetic events, allow a key to be typed as a stacktrace. # cd /sys/kernel/tracing # echo 's:block_lat u64 delta; unsigned long stack[];' >> dynamic_events # echo 'hist:keys=next_pid:ts=common_timestamp.usecs,st=stacktrace if prev_state == 2' >> events/sched/sched_switch/trigger # echo 'hist:keys=prev_pid:delta=common_timestamp.usecs-$ts,st2=$st:onmatch(sched.sched_switch).trace(block_lat,$delta,$st2)' >> events/sched/sched_switch/trigger # echo 'hist:keys=delta.buckets=100,stack.stacktrace:sort=delta' > events/synthetic/block_lat/trigger # cat events/synthetic/block_lat/hist # event histogram # # trigger info: hist:keys=delta.buckets=100,stacktrace:vals=hitcount:sort=delta.buckets=100:size=2048 [active] # { delta: ~ 0-99, stacktrace: event_hist_trigger+0x464/0x480 event_triggers_call+0x52/0xe0 trace_event_buffer_commit+0x193/0x250 trace_event_raw_event_sched_switch+0xfc/0x150 __traceiter_sched_switch+0x41/0x60 __schedule+0x448/0x7b0 schedule_idle+0x26/0x40 cpu_startup_entry+0x19/0x20 start_secondary+0xed/0xf0 secondary_startup_64_no_verify+0xe0/0xeb } hitcount: 6 { delta: ~ 0-99, stacktrace: event_hist_trigger+0x464/0x480 event_triggers_call+0x52/0xe0 trace_event_buffer_commit+0x193/0x250 trace_event_raw_event_sched_switch+0xfc/0x150 __traceiter_sched_switch+0x41/0x60 __schedule+0x448/0x7b0 schedule_idle+0x26/0x40 cpu_startup_entry+0x19/0x20 __pfx_kernel_init+0x0/0x10 arch_call_rest_init+0xa/0x24 start_kernel+0x964/0x98d secondary_startup_64_no_verify+0xe0/0xeb } hitcount: 3 { delta: ~ 0-99, stacktrace: event_hist_trigger+0x464/0x480 event_triggers_call+0x52/0xe0 trace_event_buffer_commit+0x193/0x250 trace_event_raw_event_sched_switch+0xfc/0x150 __traceiter_sched_switch+0x41/0x60 __schedule+0x448/0x7b0 schedule+0x5a/0xb0 worker_thread+0xaf/0x380 kthread+0xe9/0x110 ret_from_fork+0x2c/0x50 } hitcount: 1 { delta: ~ 100-199, stacktrace: event_hist_trigger+0x464/0x480 event_triggers_call+0x52/0xe0 trace_event_buffer_commit+0x193/0x250 trace_event_raw_event_sched_switch+0xfc/0x150 __traceiter_sched_switch+0x41/0x60 __schedule+0x448/0x7b0 schedule_idle+0x26/0x40 cpu_startup_entry+0x19/0x20 start_secondary+0xed/0xf0 secondary_startup_64_no_verify+0xe0/0xeb } hitcount: 15 [..] { delta: ~ 8500-8599, stacktrace: event_hist_trigger+0x464/0x480 event_triggers_call+0x52/0xe0 trace_event_buffer_commit+0x193/0x250 trace_event_raw_event_sched_switch+0xfc/0x150 __traceiter_sched_switch+0x41/0x60 __schedule+0x448/0x7b0 schedule_idle+0x26/0x40 cpu_startup_entry+0x19/0x20 start_secondary+0xed/0xf0 secondary_startup_64_no_verify+0xe0/0xeb } hitcount: 1 Totals: Hits: 89 Entries: 11 Dropped: 0 Link: https://lkml.kernel.org/r/20230117152236.167046397@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Ross Zwisler <zwisler@google.com> Cc: Ching-lin Yu <chinglinyu@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-01-25tracing: Allow synthetic events to pass around stacktracesSteven Rostedt (Google)
Allow a stacktrace from one event to be displayed by the end event of a synthetic event. This is very useful when looking for the longest latency of a sleep or something blocked on I/O. # cd /sys/kernel/tracing/ # echo 's:block_lat pid_t pid; u64 delta; unsigned long[] stack;' > dynamic_events # echo 'hist:keys=next_pid:ts=common_timestamp.usecs,st=stacktrace if prev_state == 1||prev_state == 2' > events/sched/sched_switch/trigger # echo 'hist:keys=prev_pid:delta=common_timestamp.usecs-$ts,s=$st:onmax($delta).trace(block_lat,prev_pid,$delta,$s)' >> events/sched/sched_switch/trigger The above creates a "block_lat" synthetic event that take the stacktrace of when a task schedules out in either the interruptible or uninterruptible states, and on a new per process max $delta (the time it was scheduled out), will print the process id and the stacktrace. # echo 1 > events/synthetic/block_lat/enable # cat trace # TASK-PID CPU# ||||| TIMESTAMP FUNCTION # | | | ||||| | | kworker/u16:0-767 [006] d..4. 560.645045: block_lat: pid=767 delta=66 stack=STACK: => __schedule => schedule => pipe_read => vfs_read => ksys_read => do_syscall_64 => 0x966000aa <idle>-0 [003] d..4. 561.132117: block_lat: pid=0 delta=413787 stack=STACK: => __schedule => schedule => schedule_hrtimeout_range_clock => do_sys_poll => __x64_sys_poll => do_syscall_64 => 0x966000aa <...>-153 [006] d..4. 562.068407: block_lat: pid=153 delta=54 stack=STACK: => __schedule => schedule => io_schedule => rq_qos_wait => wbt_wait => __rq_qos_throttle => blk_mq_submit_bio => submit_bio_noacct_nocheck => ext4_bio_write_page => mpage_submit_page => mpage_process_page_bufs => mpage_prepare_extent_to_map => ext4_do_writepages => ext4_writepages => do_writepages => __writeback_single_inode Link: https://lkml.kernel.org/r/20230117152236.010941267@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Ross Zwisler <zwisler@google.com> Cc: Ching-lin Yu <chinglinyu@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-01-25tracing: Allow stacktraces to be saved as histogram variablesSteven Rostedt (Google)
Allow to save stacktraces into a histogram variable. This will be used by synthetic events to allow a stacktrace from one event to be passed and displayed by another event. The special keyword "stacktrace" is to be used to trigger a stack trace for the event that the histogram trigger is attached to. echo 'hist:keys=pid:st=stacktrace" > events/sched/sched_waking/trigger Currently nothing can get access to the "$st" variable above that contains the stack trace, but that will soon change. Link: https://lkml.kernel.org/r/20230117152235.856323729@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Ross Zwisler <zwisler@google.com> Cc: Ching-lin Yu <chinglinyu@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-01-25tracing: Simplify calculating entry size using struct_size()Steven Rostedt (Google)
When tracing a dynamic string field for a synthetic event, the offset calculation for where to write the next event can use struct_size() to find what the current size of the structure is. This simplifies the code and makes it less error prone. Link: https://lkml.kernel.org/r/20230117152235.698632147@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Ross Zwisler <zwisler@google.com> Cc: Ching-lin Yu <chinglinyu@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-01-25tracing: Add NULL checks for buffer in ring_buffer_free_read_page()Jia-Ju Bai
In a previous commit 7433632c9ff6, buffer, buffer->buffers and buffer->buffers[cpu] in ring_buffer_wake_waiters() can be NULL, and thus the related checks are added. However, in the same call stack, these variables are also used in ring_buffer_free_read_page(): tracing_buffers_release() ring_buffer_wake_waiters(iter->array_buffer->buffer) cpu_buffer = buffer->buffers[cpu] -> Add checks by previous commit ring_buffer_free_read_page(iter->array_buffer->buffer) cpu_buffer = buffer->buffers[cpu] -> No check Thus, to avod possible null-pointer derefernces, the related checks should be added. These results are reported by a static tool designed by myself. Link: https://lkml.kernel.org/r/20230113125501.760324-1-baijiaju1990@gmail.com Reported-by: TOTE Robot <oslab@tsinghua.edu.cn> Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-01-25tracing: Add a way to filter function addresses to function namesSteven Rostedt (Google)
There's been several times where an event records a function address in its field and I needed to filter on that address for a specific function name. It required looking up the function in kallsyms, finding its size, and doing a compare of "field >= function_start && field < function_end". But this would change from boot to boot and is unreliable in scripts. Also, it is useful to have this at boot up, where the addresses will not be known. For example, on the boot command line: trace_trigger="initcall_finish.traceoff if func.function == acpi_init" To implement this, add a ".function" prefix, that will check that the field is of size long, and the only operations allowed (so far) are "==" and "!=". Link: https://lkml.kernel.org/r/20221219183213.916833763@goodmis.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Zheng Yejian <zhengyejian1@huawei.com> Reviewed-by: Ross Zwisler <zwisler@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-01-25rv: remove redundant initialization of pointer ptrColin Ian King
The pointer ptr is being initialized with a value that is never read, it is being updated later on a call to strim. Remove the extraneous initialization. Link: https://lkml.kernel.org/r/20230116161612.77192-1-colin.i.king@gmail.com Cc: Daniel Bristot de Oliveira <bristot@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-01-25tracing/filter: fix kernel-doc warningsRandy Dunlap
Use the 'struct' keyword for a struct's kernel-doc notation and use the correct function parameter name to eliminate kernel-doc warnings: kernel/trace/trace_events_filter.c:136: warning: cannot understand function prototype: 'struct prog_entry ' kerne/trace/trace_events_filter.c:155: warning: Excess function parameter 'when_to_branch' description in 'update_preds' Also correct some trivial punctuation problems. Link: https://lkml.kernel.org/r/20230108021238.16398-1-rdunlap@infradead.org Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-01-24bpf: Allow trusted args to walk struct when checking BTF IDsDavid Vernet
When validating BTF types for KF_TRUSTED_ARGS kfuncs, the verifier currently enforces that the top-level type must match when calling the kfunc. In other words, the verifier does not allow the BPF program to pass a bitwise equivalent struct, despite it being allowed according to the C standard. For example, if you have the following type: struct nf_conn___init { struct nf_conn ct; }; The C standard stipulates that it would be safe to pass a struct nf_conn___init to a kfunc expecting a struct nf_conn. The verifier currently disallows this, however, as semantically kfuncs may want to enforce that structs that have equivalent types according to the C standard, but have different BTF IDs, are not able to be passed to kfuncs expecting one or the other. For example, struct nf_conn___init may not be queried / looked up, as it is allocated but may not yet be fully initialized. On the other hand, being able to pass types that are equivalent according to the C standard will be useful for other types of kfunc / kptrs enabled by BPF. For example, in a follow-on patch, a series of kfuncs will be added which allow programs to do bitwise queries on cpumasks that are either allocated by the program (in which case they'll be a 'struct bpf_cpumask' type that wraps a cpumask_t as its first element), or a cpumask that was allocated by the main kernel (in which case it will just be a straight cpumask_t, as in task->cpus_ptr). Having the two types of cpumasks allows us to distinguish between the two for when a cpumask is read-only vs. mutatable. A struct bpf_cpumask can be mutated by e.g. bpf_cpumask_clear(), whereas a regular cpumask_t cannot be. On the other hand, a struct bpf_cpumask can of course be queried in the exact same manner as a cpumask_t, with e.g. bpf_cpumask_test_cpu(). If we were to enforce that top level types match, then a user that's passing a struct bpf_cpumask to a read-only cpumask_t argument would have to cast with something like bpf_cast_to_kern_ctx() (which itself would need to be updated to expect the alias, and currently it only accommodates a single alias per prog type). Additionally, not specifying KF_TRUSTED_ARGS is not an option, as some kfuncs take one argument as a struct bpf_cpumask *, and another as a struct cpumask * (i.e. cpumask_t). In order to enable this, this patch relaxes the constraint that a KF_TRUSTED_ARGS kfunc must have strict type matching, and instead only enforces strict type matching if a type is observed to be a "no-cast alias" (i.e., that the type names are equivalent, but one is suffixed with ___init). Additionally, in order to try and be conservative and match existing behavior / expectations, this patch also enforces strict type checking for acquire kfuncs. We were already enforcing it for release kfuncs, so this should also improve the consistency of the semantics for kfuncs. Signed-off-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20230120192523.3650503-3-void@manifault.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-24bpf: Enable annotating trusted nested pointersDavid Vernet
In kfuncs, a "trusted" pointer is a pointer that the kfunc can assume is safe, and which the verifier will allow to be passed to a KF_TRUSTED_ARGS kfunc. Currently, a KF_TRUSTED_ARGS kfunc disallows any pointer to be passed at a nonzero offset, but sometimes this is in fact safe if the "nested" pointer's lifetime is inherited from its parent. For example, the const cpumask_t *cpus_ptr field in a struct task_struct will remain valid until the task itself is destroyed, and thus would also be safe to pass to a KF_TRUSTED_ARGS kfunc. While it would be conceptually simple to enable this by using BTF tags, gcc unfortunately does not yet support this. In the interim, this patch enables support for this by using a type-naming convention. A new BTF_TYPE_SAFE_NESTED macro is defined in verifier.c which allows a developer to specify the nested fields of a type which are considered trusted if its parent is also trusted. The verifier is also updated to account for this. A patch with selftests will be added in a follow-on change, along with documentation for this feature. Signed-off-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20230120192523.3650503-2-void@manifault.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-24trace_events_hist: add check for return value of 'create_hist_field'Natalia Petrova
Function 'create_hist_field' is called recursively at trace_events_hist.c:1954 and can return NULL-value that's why we have to check it to avoid null pointer dereference. Found by Linux Verification Center (linuxtesting.org) with SVACE. Link: https://lkml.kernel.org/r/20230111120409.4111-1-n.petrova@fintech.ru Cc: stable@vger.kernel.org Fixes: 30350d65ac56 ("tracing: Add variable support to hist triggers") Signed-off-by: Natalia Petrova <n.petrova@fintech.ru> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-01-24clocksource: Suspend the watchdog temporarily when high read latency detectedFeng Tang
Bugs have been reported on 8 sockets x86 machines in which the TSC was wrongly disabled when the system is under heavy workload. [ 818.380354] clocksource: timekeeping watchdog on CPU336: hpet wd-wd read-back delay of 1203520ns [ 818.436160] clocksource: wd-tsc-wd read-back delay of 181880ns, clock-skew test skipped! [ 819.402962] clocksource: timekeeping watchdog on CPU338: hpet wd-wd read-back delay of 324000ns [ 819.448036] clocksource: wd-tsc-wd read-back delay of 337240ns, clock-skew test skipped! [ 819.880863] clocksource: timekeeping watchdog on CPU339: hpet read-back delay of 150280ns, attempt 3, marking unstable [ 819.936243] tsc: Marking TSC unstable due to clocksource watchdog [ 820.068173] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'. [ 820.092382] sched_clock: Marking unstable (818769414384, 1195404998) [ 820.643627] clocksource: Checking clocksource tsc synchronization from CPU 267 to CPUs 0,4,25,70,126,430,557,564. [ 821.067990] clocksource: Switched to clocksource hpet This can be reproduced by running memory intensive 'stream' tests, or some of the stress-ng subcases such as 'ioport'. The reason for these issues is the when system is under heavy load, the read latency of the clocksources can be very high. Even lightweight TSC reads can show high latencies, and latencies are much worse for external clocksources such as HPET or the APIC PM timer. These latencies can result in false-positive clocksource-unstable determinations. These issues were initially reported by a customer running on a production system, and this problem was reproduced on several generations of Xeon servers, especially when running the stress-ng test. These Xeon servers were not production systems, but they did have the latest steppings and firmware. Given that the clocksource watchdog is a continual diagnostic check with frequency of twice a second, there is no need to rush it when the system is under heavy load. Therefore, when high clocksource read latencies are detected, suspend the watchdog timer for 5 minutes. Signed-off-by: Feng Tang <feng.tang@intel.com> Acked-by: Waiman Long <longman@redhat.com> Cc: John Stultz <jstultz@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Stephen Boyd <sboyd@kernel.org> Cc: Feng Tang <feng.tang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-24tracing/osnoise: Use built-in RCU list checkingChuang Wang
list_for_each_entry_rcu() has built-in RCU and lock checking. Pass cond argument to list_for_each_entry_rcu() to silence false lockdep warning when CONFIG_PROVE_RCU_LIST is enabled. Execute as follow: [tracing]# echo osnoise > current_tracer [tracing]# echo 1 > tracing_on [tracing]# echo 0 > tracing_on The trace_types_lock is held when osnoise_tracer_stop() or timerlat_tracer_stop() are called in the non-RCU read side section. So, pass lockdep_is_held(&trace_types_lock) to silence false lockdep warning. Link: https://lkml.kernel.org/r/20221227023036.784337-1-nashuiliang@gmail.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Fixes: dae181349f1e ("tracing/osnoise: Support a list of trace_array *tr") Acked-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Chuang Wang <nashuiliang@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-01-24module: Don't wait for GOING modulesPetr Pavlu
During a system boot, it can happen that the kernel receives a burst of requests to insert the same module but loading it eventually fails during its init call. For instance, udev can make a request to insert a frequency module for each individual CPU when another frequency module is already loaded which causes the init function of the new module to return an error. Since commit 6e6de3dee51a ("kernel/module.c: Only return -EEXIST for modules that have finished loading"), the kernel waits for modules in MODULE_STATE_GOING state to finish unloading before making another attempt to load the same module. This creates unnecessary work in the described scenario and delays the boot. In the worst case, it can prevent udev from loading drivers for other devices and might cause timeouts of services waiting on them and subsequently a failed boot. This patch attempts a different solution for the problem 6e6de3dee51a was trying to solve. Rather than waiting for the unloading to complete, it returns a different error code (-EBUSY) for modules in the GOING state. This should avoid the error situation that was described in 6e6de3dee51a (user space attempting to load a dependent module because the -EEXIST error code would suggest to user space that the first module had been loaded successfully), while avoiding the delay situation too. This has been tested on linux-next since December 2022 and passes all kmod selftests except test 0009 with module compression enabled but it has been confirmed that this issue has existed and has gone unnoticed since prior to this commit and can also be reproduced without module compression with a simple usleep(5000000) on tools/modprobe.c [0]. These failures are caused by hitting the kernel mod_concurrent_max and can happen either due to a self inflicted kernel module auto-loead DoS somehow or on a system with large CPU count and each CPU count incorrectly triggering many module auto-loads. Both of those issues need to be fixed in-kernel. [0] https://lore.kernel.org/all/Y9A4fiobL6IHp%2F%2FP@bombadil.infradead.org/ Fixes: 6e6de3dee51a ("kernel/module.c: Only return -EEXIST for modules that have finished loading") Co-developed-by: Martin Wilck <mwilck@suse.com> Signed-off-by: Martin Wilck <mwilck@suse.com> Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Cc: stable@vger.kernel.org Reviewed-by: Petr Mladek <pmladek@suse.com> [mcgrof: enhance commit log with testing and kmod test result interpretation ] Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-01-24tracing: Kconfig: Fix spelling/grammar/punctuationRandy Dunlap
Fix some editorial nits in trace Kconfig. Link: https://lkml.kernel.org/r/20230124181647.15902-1-rdunlap@infradead.org Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-01-24tracing: Make sure trace_printk() can output as soon as it can be usedSteven Rostedt (Google)
Currently trace_printk() can be used as soon as early_trace_init() is called from start_kernel(). But if a crash happens, and "ftrace_dump_on_oops" is set on the kernel command line, all you get will be: [ 0.456075] <idle>-0 0dN.2. 347519us : Unknown type 6 [ 0.456075] <idle>-0 0dN.2. 353141us : Unknown type 6 [ 0.456075] <idle>-0 0dN.2. 358684us : Unknown type 6 This is because the trace_printk() event (type 6) hasn't been registered yet. That gets done via an early_initcall(), which may be early, but not early enough. Instead of registering the trace_printk() event (and other ftrace events, which are not trace events) via an early_initcall(), have them registered at the same time that trace_printk() can be used. This way, if there is a crash before early_initcall(), then the trace_printk()s will actually be useful. Link: https://lkml.kernel.org/r/20230104161412.019f6c55@gandalf.local.home Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Fixes: e725c731e3bb1 ("tracing: Split tracing initialization into two for early initialization") Reported-by: "Joel Fernandes (Google)" <joel@joelfernandes.org> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-01-24ftrace: Export ftrace_free_filter() to modulesMark Rutland
Setting filters on an ftrace ops results in some memory being allocated for the filter hashes, which must be freed before the ops can be freed. This can be done by removing every individual element of the hash by calling ftrace_set_filter_ip() or ftrace_set_filter_ips() with `remove` set, but this is somewhat error prone as it's easy to forget to remove an element. Make it easier to clean this up by exporting ftrace_free_filter(), which can be used to clean up all of the filter hashes after an ftrace_ops has been unregistered. Using this, fix the ftrace-direct* samples to free hashes prior to being unloaded. All other code either removes individual filters explicitly or is built-in and already calls ftrace_free_filter(). Link: https://lkml.kernel.org/r/20230103124912.2948963-3-mark.rutland@arm.com Cc: stable@vger.kernel.org Cc: Florent Revest <revest@chromium.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Fixes: e1067a07cfbc ("ftrace/samples: Add module to test multi direct modify interface") Fixes: 5fae941b9a6f ("ftrace/samples: Add multi direct interface test module") Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-01-24Compiler attributes: GCC cold function alignment workaroundsMark Rutland
Contemporary versions of GCC (e.g. GCC 12.2.0) drop the alignment specified by '-falign-functions=N' for functions marked with the __cold__ attribute, and potentially for callees of __cold__ functions as these may be implicitly marked as __cold__ by the compiler. LLVM appears to respect '-falign-functions=N' in such cases. This has been reported to GCC in bug 88345: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88345 ... which also covers alignment being dropped when '-Os' is used, which will be addressed in a separate patch. Currently, use of '-falign-functions=N' is limited to CONFIG_FUNCTION_ALIGNMENT, which is largely used for performance and/or analysis reasons (e.g. with CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B), but isn't necessary for correct functionality. However, this dropped alignment isn't great for the performance and/or analysis cases. Subsequent patches will use CONFIG_FUNCTION_ALIGNMENT as part of arm64's ftrace implementation, which will require all instrumented functions to be aligned to at least 8-bytes. This patch works around the dropped alignment by avoiding the use of the __cold__ attribute when CONFIG_FUNCTION_ALIGNMENT is non-zero, and by specifically aligning abort(), which GCC implicitly marks as __cold__. As the __cold macro is now dependent upon config options (which is against the policy described at the top of compiler_attributes.h), it is moved into compiler_types.h. I've tested this by building and booting a kernel configured with defconfig + CONFIG_EXPERT=y + CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B=y, and looking for misaligned text symbols in /proc/kallsyms: * arm64: Before: # uname -rm 6.2.0-rc3 aarch64 # grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | wc -l 5009 After: # uname -rm 6.2.0-rc3-00001-g2a2bedf8bfa9 aarch64 # grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | wc -l 919 * x86_64: Before: # uname -rm 6.2.0-rc3 x86_64 # grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | wc -l 11537 After: # uname -rm 6.2.0-rc3-00001-g2a2bedf8bfa9 x86_64 # grep ' [Tt] ' /proc/kallsyms | grep -iv '[048c]0 [Tt] ' | wc -l 2805 There's clearly a substantial reduction in the number of misaligned symbols. From manual inspection, the remaining unaligned text labels are a combination of ACPICA functions (due to the use of '-Os'), static call trampolines, and non-function labels in assembly, which will be dealt with in subsequent patches. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Florent Revest <revest@chromium.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Will Deacon <will@kernel.org> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Link: https://lore.kernel.org/r/20230123134603.1064407-3-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2023-01-24ftrace: Add DYNAMIC_FTRACE_WITH_CALL_OPSMark Rutland
Architectures without dynamic ftrace trampolines incur an overhead when multiple ftrace_ops are enabled with distinct filters. in these cases, each call site calls a common trampoline which uses ftrace_ops_list_func() to iterate over all enabled ftrace functions, and so incurs an overhead relative to the size of this list (including RCU protection overhead). Architectures with dynamic ftrace trampolines avoid this overhead for call sites which have a single associated ftrace_ops. In these cases, the dynamic trampoline is customized to branch directly to the relevant ftrace function, avoiding the list overhead. On some architectures it's impractical and/or undesirable to implement dynamic ftrace trampolines. For example, arm64 has limited branch ranges and cannot always directly branch from a call site to an arbitrary address (e.g. from a kernel text address to an arbitrary module address). Calls from modules to core kernel text can be indirected via PLTs (allocated at module load time) to address this, but the same is not possible from calls from core kernel text. Using an indirect branch from a call site to an arbitrary trampoline is possible, but requires several more instructions in the function prologue (or immediately before it), and/or comes with far more complex requirements for patching. Instead, this patch adds a new option, where an architecture can associate each call site with a pointer to an ftrace_ops, placed at a fixed offset from the call site. A shared trampoline can recover this pointer and call ftrace_ops::func() without needing to go via ftrace_ops_list_func(), avoiding the associated overhead. This avoids issues with branch range limitations, and avoids the need to allocate and manipulate dynamic trampolines, making it far simpler to implement and maintain, while having similar performance characteristics. Note that this allows for dynamic ftrace_ops to be invoked directly from an architecture's ftrace_caller trampoline, whereas existing code forces the use of ftrace_ops_get_list_func(), which is in part necessary to permit the ftrace_ops to be freed once unregistered *and* to avoid branch/address-generation range limitation on some architectures (e.g. where ops->func is a module address, and may be outside of the direct branch range for callsites within the main kernel image). The CALL_OPS approach avoids this problems and is safe as: * The existing synchronization in ftrace_shutdown() using ftrace_shutdown() using synchronize_rcu_tasks_rude() (and synchronize_rcu_tasks()) ensures that no tasks hold a stale reference to an ftrace_ops (e.g. in the middle of the ftrace_caller trampoline, or while invoking ftrace_ops::func), when that ftrace_ops is unregistered. Arguably this could also be relied upon for the existing scheme, permitting dynamic ftrace_ops to be invoked directly when ops->func is in range, but this will require additional logic to handle branch range limitations, and is not handled by this patch. * Each callsite's ftrace_ops pointer literal can hold any valid kernel address, and is updated atomically. As an architecture's ftrace_caller trampoline will atomically load the ops pointer then dereference ops->func, there is no risk of invoking ops->func with a mismatches ops pointer, and updates to the ops pointer do not require special care. A subsequent patch will implement architectures support for arm64. There should be no functional change as a result of this patch alone. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Florent Revest <revest@chromium.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20230123134603.1064407-2-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2023-01-23rcu: Disable laziness if lazy-tracking says soJoel Fernandes (Google)
During suspend, we see failures to suspend 1 in 300-500 suspends. Looking closer, it appears that asynchronous RCU callbacks are being queued as lazy even though synchronous callbacks are expedited. These delays appear to not be very welcome by the suspend/resume code as evidenced by these occasional suspend failures. This commit modifies call_rcu() to check if rcu_async_should_hurry(), which will return true if we are in suspend or in-kernel boot. [ paulmck: Alphabetize local variables. ] Ignoring the lazy hint makes the 3000 suspend/resume cycles pass reliably on a 12th gen 12-core Intel CPU, and there is some evidence that it also slightly speeds up boot performance. Fixes: 3cb278e73be5 ("rcu: Make call_rcu() lazy to save power") Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2023-01-23bpf: Support consuming XDP HW metadata from fext programsToke Høiland-Jørgensen
Instead of rejecting the attaching of PROG_TYPE_EXT programs to XDP programs that consume HW metadata, implement support for propagating the offload information. The extension program doesn't need to set a flag or ifindex, these will just be propagated from the target by the verifier. We need to create a separate offload object for the extension program, though, since it can be reattached to a different program later (which means we can't just inherit the offload information from the target). An additional check is added on attach that the new target is compatible with the offload information in the extension prog. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20230119221536.3349901-9-sdf@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>