summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2022-05-02blk-cgroup: replace bio_blkcg with bio_blkcg_cssChristoph Hellwig
All callers of bio_blkcg actually want the CSS, so replace it with an interface that does return the CSS. This now allows to move struct blkcg_gq to block/blk-cgroup.h instead of exposing it in a public header. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220420042723.1010598-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-02blktrace: cleanup the __trace_note_message interfaceChristoph Hellwig
Pass the cgroup_subsys_state instead of a the blkg so that blktrace doesn't need to poke into blk-cgroup internals, and give the name a blk prefix as the current name is way too generic for a public interface. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20220420042723.1010598-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-02time/sched_clock: Fix formatting of frequency reporting codeMaciej W. Rozycki
Use flat rather than nested indentation for chained else/if clauses as per coding-style.rst: if (x == y) { .. } else if (x > y) { ... } else { .... } This also improves readability. Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/r/alpine.DEB.2.21.2204240148220.9383@angie.orcam.me.uk
2022-05-02time/sched_clock: Use Hz as the unit for clock rate reporting below 4kHzMaciej W. Rozycki
The kernel uses kHz as the unit for clock rates reported between 1MHz (inclusive) and 4MHz (exclusive), e.g.: sched_clock: 64 bits at 1000kHz, resolution 1000ns, wraps every 2199023255500ns This reduces the amount of data lost due to rounding, but hasn't been replicated for the kHz range when support was added for proper reporting of sub-kHz clock rates. Take the same approach for rates between 1kHz (inclusive) and 4kHz (exclusive), which makes it consistent. Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/alpine.DEB.2.21.2204240106380.9383@angie.orcam.me.uk
2022-05-02time/sched_clock: Round the frequency reported to nearest rather than downMaciej W. Rozycki
The frequency reported for clock sources are rounded down, which gives misleading figures, e.g.: I/O ASIC clock frequency 24999480Hz sched_clock: 32 bits at 24MHz, resolution 40ns, wraps every 85901132779ns MIPS counter frequency 59998512Hz sched_clock: 32 bits at 59MHz, resolution 16ns, wraps every 35792281591ns Rounding to nearest is more adequate: I/O ASIC clock frequency 24999664Hz sched_clock: 32 bits at 25MHz, resolution 40ns, wraps every 85900499947ns MIPS counter frequency 59999728Hz sched_clock: 32 bits at 60MHz, resolution 16ns, wraps every 35791556599ns Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/r/alpine.DEB.2.21.2204240055590.9383@angie.orcam.me.uk
2022-05-02genirq: Use pm_runtime_resume_and_get() instead of pm_runtime_get_sync()Minghao Chi
pm_runtime_resume_and_get() achieves the same and simplifies the code. [ tglx: Simplify it further by presetting retval ] Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Minghao Chi <chi.minghao@zte.com.cn> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20220418110716.2559453-1-chi.minghao@zte.com.cn
2022-05-02timekeeping: Consolidate fast timekeeperThomas Gleixner
Provide a inline function which replaces the copy & pasta. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220415091921.072296632@linutronix.de
2022-05-02timekeeping: Annotate ktime_get_boot_fast_ns() with data_race()Thomas Gleixner
Accessing timekeeper::offset_boot in ktime_get_boot_fast_ns() is an intended data race as the reader side cannot synchronize with a writer and there is no space in struct tk_read_base of the NMI safe timekeeper. Mark it so. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20220415091920.956045162@linutronix.de
2022-05-01mm: Fix PASID use-after-free issueFenghua Yu
The PASID is being freed too early. It needs to stay around until after device drivers that might be using it have had a chance to clear it out of the hardware. The relevant refcounts are: mmget() /mmput() refcount the mm's address space mmgrab()/mmdrop() refcount the mm itself The PASID is currently tied to the life of the mm's address space and freed in __mmput(). This makes logical sense because the PASID can't be used once the address space is gone. But, this misses an important point: even after the address space is gone, the PASID will still be programmed into a device. Device drivers might, for instance, still need to flush operations that are outstanding and need to use that PASID. They do this at file->release() time. Device drivers call the IOMMU driver to hold a reference on the mm itself and drop it at file->release() time. But, the IOMMU driver holds a reference on the mm itself, not the address space. The address space (and the PASID) is long gone by the time the driver tries to clean up. This is effectively a use-after-free bug on the PASID. To fix this, move the PASID free operation from __mmput() to __mmdrop(). This ensures that the IOMMU driver's existing mmgrab() keeps the PASID allocated until it drops its mm reference. Fixes: 701fac40384f ("iommu/sva: Assign a PASID to mm on PASID allocation and free it on mm exit") Reported-by: Zhangfei Gao <zhangfei.gao@foxmail.com> Suggested-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Suggested-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Zhangfei Gao <zhangfei.gao@foxmail.com> Reviewed-by: Jean-Philippe Brucker <jean-philippe@linaro.org> Link: https://lore.kernel.org/r/20220428180041.806809-1-fenghua.yu@intel.com
2022-05-01smp: Make softirq handling RT safe in flush_smp_call_function_queue()Sebastian Andrzej Siewior
flush_smp_call_function_queue() invokes do_softirq() which is not available on PREEMPT_RT. flush_smp_call_function_queue() is invoked from the idle task and the migration task with preemption or interrupts disabled. So RT kernels cannot process soft interrupts in that context as that has to acquire 'sleeping spinlocks' which is not possible with preemption or interrupts disabled and forbidden from the idle task anyway. The currently known SMP function call which raises a soft interrupt is in the block layer, but this functionality is not enabled on RT kernels due to latency and performance reasons. RT could wake up ksoftirqd unconditionally, but this wants to be avoided if there were soft interrupts pending already when this is invoked in the context of the migration task. The migration task might have preempted a threaded interrupt handler which raised a soft interrupt, but did not reach the local_bh_enable() to process it. The "running" ksoftirqd might prevent the handling in the interrupt thread context which is causing latency issues. Add a new function which handles this case explicitely for RT and falls back to do_softirq() on !RT kernels. In the RT case this warns when one of the flushed SMP function calls raised a soft interrupt so this can be investigated. [ tglx: Moved the RT part out of SMP code ] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/YgKgL6aPj8aBES6G@linutronix.de Link: https://lore.kernel.org/r/20220413133024.356509586@linutronix.de
2022-05-01smp: Rename flush_smp_call_function_from_idle()Thomas Gleixner
This is invoked from the stopper thread too, which is definitely not idle. Rename it to flush_smp_call_function_queue() and fixup the callers. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220413133024.305001096@linutronix.de
2022-05-01sched: Fix missing prototype warningsThomas Gleixner
A W=1 build emits more than a dozen missing prototype warnings related to scheduler and scheduler specific includes. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220413133024.249118058@linutronix.de
2022-04-30task_work: allow TWA_SIGNAL without a rescheduling IPIJens Axboe
Some use cases don't always need an IPI when sending a TWA_SIGNAL notification. Add TWA_SIGNAL_NO_IPI, which is just like TWA_SIGNAL, except it doesn't send an IPI to the target task. It merely sets TIF_NOTIFY_SIGNAL and wakes up the task. This can be useful in avoiding a forceful transition to the kernel if the task is running in userspace. Depending on the task_work in question, it may be quite fine waiting for the next reschedule or kernel enter anyway, or the use case may even have other mechanisms for hinting to the task that a transition may be useful. This can drive more cooperative scheduling of task_work. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/821f42b6-7d91-8074-8212-d34998097de4@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-29kernel: make taskstats available from all net namespacesxu xin
If getdelays runs in a non-init network namespace, it will fail in getting delayacct stats even if it has privilege of root user, which seems to be not very reasonable. We can simply reproduce this by executing commands: unshare -n getdelays -d -p <pid> I don't think net namespace should be an obstacle to the normal execution of getdelay function. So let's make it available from all net namespaces. Link: https://lkml.kernel.org/r/20220412071946.2532318-1-xu.xin16@zte.com.cn Signed-off-by: xu xin <xu.xin16@zte.com.cn> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Yang Yang <yang.yang29@zte.com.cn> Cc: "Dr. Thomas Orgis" <thomas.orgis@uni-hamburg.de> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Ismael Luceno <ismael@iodev.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-29taskstats: version 12 with thread group and exe infoDr. Thomas Orgis
The task exit struct needs some crucial information to be able to provide an enhanced version of process and thread accounting. This change provides: 1. ac_tgid in additon to ac_pid 2. thread group execution walltime in ac_tgetime 3. flag AGROUP in ac_flag to indicate the last task in a thread group / process 4. device ID and inode of task's /proc/self/exe in ac_exe_dev and ac_exe_inode 5. tools/accounting/procacct as demonstrator When a task exits, taskstats are reported to userspace including the task's pid and ppid, but without the id of the thread group this task is part of. Without the tgid, the stats of single tasks cannot be correlated to each other as a thread group (process). The taskstats documentation suggests that on process exit a data set consisting of accumulated stats for the whole group is produced. But such an additional set of stats is only produced for actually multithreaded processes, not groups that had only one thread, and also those stats only contain data about delay accounting and not the more basic information about CPU and memory resource usage. Adding the AGROUP flag to be set when the last task of a group exited enables determination of process end also for single-threaded processes. My applicaton basically does enhanced process accounting with summed cputime, biggest maxrss, tasks per process. The data is not available with the traditional BSD process accounting (which is not designed to be extensible) and the taskstats interface allows more efficient on-the-fly grouping and summing of the stats, anyway, without intermediate disk writes. Furthermore, I do carry statistics on which exact program binary is used how often with associated resources, getting a picture on how important which parts of a collection of installed scientific software in different versions are, and how well they put load on the machine. This is enabled by providing information on /proc/self/exe for each task. I assume the two 64-bit fields for device ID and inode are more appropriate than the possibly large resolved path to keep the data volume down. Add the tgid to the stats to complete task identification, the flag AGROUP to mark the last task of a group, the group wallclock time, and inode-based identification of the associated executable file. Add tools/accounting/procacct.c as a simplified fork of getdelays.c to demonstrate process and thread accounting. [thomas.orgis@uni-hamburg.de: fix version number in comment] Link: https://lkml.kernel.org/r/20220405003601.7a5f6008@plasteblaster Link: https://lkml.kernel.org/r/20220331004106.64e5616b@plasteblaster Signed-off-by: Dr. Thomas Orgis <thomas.orgis@uni-hamburg.de> Reviewed-by: Ismael Luceno <ismael@iodev.co.uk> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Yang Yang <yang.yang29@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-29kexec: remove redundant assignmentsMichal Orzel
Get rid of redundant assignments which end up in values not being read either because they are overwritten or the function ends. Reported by clang-tidy [deadcode.DeadStores] Link: https://lkml.kernel.org/r/20220326180948.192154-1-michalorzel.eng@gmail.com Signed-off-by: Michal Orzel <michalorzel.eng@gmail.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Michal Orzel <michalorzel.eng@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-29ptrace: remove redudant check of #ifdef PTRACE_SINGLESTEPTiezhu Yang
Patch series "ptrace: do some cleanup". This patch (of 3): PTRACE_SINGLESTEP is always defined as 9 in include/uapi/linux/ptrace.h, remove redudant check of #ifdef PTRACE_SINGLESTEP. Link: https://lkml.kernel.org/r/1649240981-11024-2-git-send-email-yangtiezhu@loongson.cn Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-29lib/Kconfig.debug: remove more CONFIG_..._VALUE indirectionsRasmus Villemoes
As in "kernel/panic.c: remove CONFIG_PANIC_ON_OOPS_VALUE indirection", use the IS_ENABLED() helper rather than having a hidden config option. Link: https://lkml.kernel.org/r/20220321121301.1389693-1-linux@rasmusvillemoes.dk Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-29kernel: pid_namespace: use NULL instead of using plain integer as pointerHaowen Bai
This fixes the following sparse warnings: kernel/pid_namespace.c:55:77: warning: Using plain integer as NULL pointer Link: https://lkml.kernel.org/r/1647944288-2806-1-git-send-email-baihaowen@meizu.com Signed-off-by: Haowen Bai <baihaowen@meizu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-29seccomp: Use FIFO semantics to order notificationsSargun Dhillon
Previously, the seccomp notifier used LIFO semantics, where each notification would be added on top of the stack, and notifications were popped off the top of the stack. This could result one process that generates a large number of notifications preventing other notifications from being handled. This patch moves from LIFO (stack) semantics to FIFO (queue semantics). Signed-off-by: Sargun Dhillon <sargun@sargun.me> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220428015447.13661-1-sargun@sargun.me
2022-04-29ftrace: cleanup ftrace_graph_caller enable and disableChengming Zhou
The ftrace_[enable,disable]_ftrace_graph_caller() are used to do special hooks for graph tracer, which are not needed on some ARCHs that use graph_ops:func function to install return_hooker. So introduce the weak version in ftrace core code to cleanup in x86. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/r/20220420160006.17880-1-zhouchengming@bytedance.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2022-04-29sched/fair: Remove cfs_rq_tg_path()Dietmar Eggemann
cfs_rq_tg_path() is used by a tracepoint-to traceevent (tp-2-te) converter to format the path of a taskgroup or autogroup respectively. It doesn't have any in-kernel users after the removal of the sched_trace_cfs_rq_path() helper function. cfs_rq_tg_path() can be coded in a tp-2-te converter. Remove it from kernel/sched/fair.c. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220428144338.479094-3-qais.yousef@arm.com
2022-04-29sched/fair: Remove sched_trace_*() helper functionsDietmar Eggemann
We no longer need them as we can use DWARF debug info or BTF + pahole to re-generate the required structs to compile against them for a given kernel. This moves the burden of maintaining these helper functions to the module. https://github.com/qais-yousef/sched_tp Note that pahole v1.15 is required at least for using DWARF. And for BTF v1.23 which is not yet released will be required. There's alignment problem that will lead to crashes in earlier versions when used with BTF. We should have enough infrastructure to make these helper functions now obsolete, so remove them. [Rewrote commit message to reflect the new alternative] Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220428144338.479094-2-qais.yousef@arm.com
2022-04-29sched/fair: Refactor cpu_util_without()Dietmar Eggemann
Except the 'task has no contribution or is new' condition at the beginning of cpu_util_without(), which it shares with the load and runnable counterpart functions, a cpu_util_next(..., dst_cpu = -1) call can replace the rest of it. The UTIL_EST specific check that task util_est has to be subtracted from the CPU one in case of an enqueued (or current (to cater for the wakeup - lb race)) task has to be moved to cpu_util_next(). This was initially introduced by commit c469933e7721 ("sched/fair: Fix cpu_util_wake() for 'execl' type workloads"). UnixBench's `execl` throughput tests were run on the dual socket 40 CPUs Intel E5-2690 v2 to make sure it doesn't regress again. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20220318163656.954440-1-dietmar.eggemann@arm.com
2022-04-29timekeeping: Mark NMI safe time accessors as notraceKurt Kanzenbach
Mark the CLOCK_MONOTONIC fast time accessors as notrace. These functions are used in tracing to retrieve timestamps, so they should not recurse. Fixes: 4498e7467e9e ("time: Parametrize all tk_fast_mono users") Fixes: f09cb9a1808e ("time: Introduce tk_fast_raw") Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20220426175338.3807ca4f@gandalf.local.home/ Link: https://lore.kernel.org/r/20220428062432.61063-1-kurt@linutronix.de
2022-04-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
include/linux/netdevice.h net/core/dev.c 6510ea973d8d ("net: Use this_cpu_inc() to increment net->core_stats") 794c24e9921f ("net-core: rx_otherhost_dropped to core_stats") https://lore.kernel.org/all/20220428111903.5f4304e0@canb.auug.org.au/ drivers/net/wan/cosa.c d48fea8401cf ("net: cosa: fix error check return value of register_chrdev()") 89fbca3307d4 ("net: wan: remove support for COSA and SRP synchronous serial boards") https://lore.kernel.org/all/20220428112130.1f689e5e@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-28Merge tag 'net-5.18-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from bluetooth, bpf and netfilter. Current release - new code bugs: - bridge: switchdev: check br_vlan_group() return value - use this_cpu_inc() to increment net->core_stats, fix preempt-rt Previous releases - regressions: - eth: stmmac: fix write to sgmii_adapter_base Previous releases - always broken: - netfilter: nf_conntrack_tcp: re-init for syn packets only, resolving issues with TCP fastopen - tcp: md5: fix incorrect tcp_header_len for incoming connections - tcp: fix F-RTO may not work correctly when receiving DSACK - tcp: ensure use of most recently sent skb when filling rate samples - tcp: fix potential xmit stalls caused by TCP_NOTSENT_LOWAT - virtio_net: fix wrong buf address calculation when using xdp - xsk: fix forwarding when combining copy mode with busy poll - xsk: fix possible crash when multiple sockets are created - bpf: lwt: fix crash when using bpf_skb_set_tunnel_key() from bpf_xmit lwt hook - sctp: null-check asoc strreset_chunk in sctp_generate_reconf_event - wireguard: device: check for metadata_dst with skb_valid_dst() - netfilter: update ip6_route_me_harder to consider L3 domain - gre: make o_seqno start from 0 in native mode - gre: switch o_seqno to atomic to prevent races in collect_md mode Misc: - add Eric Dumazet to networking maintainers - dt: dsa: realtek: remove realtek,rtl8367s string - netfilter: flowtable: Remove the empty file" * tag 'net-5.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (65 commits) tcp: fix F-RTO may not work correctly when receiving DSACK Revert "ibmvnic: Add ethtool private flag for driver-defined queue limits" net: enetc: allow tc-etf offload even with NETIF_F_CSUM_MASK ixgbe: ensure IPsec VF<->PF compatibility MAINTAINERS: Update BNXT entry with firmware files netfilter: nft_socket: only do sk lookups when indev is available net: fec: add missing of_node_put() in fec_enet_init_stop_mode() bnx2x: fix napi API usage sequence tls: Skip tls_append_frag on zero copy size Add Eric Dumazet to networking maintainers netfilter: conntrack: fix udp offload timeout sysctl netfilter: nf_conntrack_tcp: re-init for syn packets only net: dsa: lantiq_gswip: Don't set GSWIP_MII_CFG_RMII_CLK net: Use this_cpu_inc() to increment net->core_stats Bluetooth: hci_sync: Cleanup hci_conn if it cannot be aborted Bluetooth: hci_event: Fix creating hci_conn object on error status Bluetooth: hci_event: Fix checking for invalid handle on error status ice: fix use-after-free when deinitializing mailbox snapshot ice: wait 5 s for EMP reset after firmware flash ice: Protect vf_state check by cfg_lock in ice_vc_process_vf_msg() ...
2022-04-27Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski
Daniel Borkmann says: ==================== pull-request: bpf-next 2022-04-27 We've added 85 non-merge commits during the last 18 day(s) which contain a total of 163 files changed, 4499 insertions(+), 1521 deletions(-). The main changes are: 1) Teach libbpf to enhance BPF verifier log with human-readable and relevant information about failed CO-RE relocations, from Andrii Nakryiko. 2) Add typed pointer support in BPF maps and enable it for unreferenced pointers (via probe read) and referenced ones that can be passed to in-kernel helpers, from Kumar Kartikeya Dwivedi. 3) Improve xsk to break NAPI loop when rx queue gets full to allow for forward progress to consume descriptors, from Maciej Fijalkowski & Björn Töpel. 4) Fix a small RCU read-side race in BPF_PROG_RUN routines which dereferenced the effective prog array before the rcu_read_lock, from Stanislav Fomichev. 5) Implement BPF atomic operations for RV64 JIT, and add libbpf parsing logic for USDT arguments under riscv{32,64}, from Pu Lehui. 6) Implement libbpf parsing of USDT arguments under aarch64, from Alan Maguire. 7) Enable bpftool build for musl and remove nftw with FTW_ACTIONRETVAL usage so it can be shipped under Alpine which is musl-based, from Dominique Martinet. 8) Clean up {sk,task,inode} local storage trace RCU handling as they do not need to use call_rcu_tasks_trace() barrier, from KP Singh. 9) Improve libbpf API documentation and fix error return handling of various API functions, from Grant Seltzer. 10) Enlarge offset check for bpf_skb_{load,store}_bytes() helpers given data length of frags + frag_list may surpass old offset limit, from Liu Jian. 11) Various improvements to prog_tests in area of logging, test execution and by-name subtest selection, from Mykola Lysenko. 12) Simplify map_btf_id generation for all map types by moving this process to build time with help of resolve_btfids infra, from Menglong Dong. 13) Fix a libbpf bug in probing when falling back to legacy bpf_probe_read*() helpers; the probing caused always to use old helpers, from Runqing Yang. 14) Add support for ARCompact and ARCv2 platforms for libbpf's PT_REGS tracing macros, from Vladimir Isaev. 15) Cleanup BPF selftests to remove old & unneeded rlimit code given kernel switched to memcg-based memory accouting a while ago, from Yafang Shao. 16) Refactor of BPF sysctl handlers to move them to BPF core, from Yan Zhu. 17) Fix BPF selftests in two occasions to work around regressions caused by latest LLVM to unblock CI until their fixes are worked out, from Yonghong Song. 18) Misc cleanups all over the place, from various others. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (85 commits) selftests/bpf: Add libbpf's log fixup logic selftests libbpf: Fix up verifier log for unguarded failed CO-RE relos libbpf: Simplify bpf_core_parse_spec() signature libbpf: Refactor CO-RE relo human description formatting routine libbpf: Record subprog-resolved CO-RE relocations unconditionally selftests/bpf: Add CO-RE relos and SEC("?...") to linked_funcs selftests libbpf: Avoid joining .BTF.ext data with BPF programs by section name libbpf: Fix logic for finding matching program for CO-RE relocation libbpf: Drop unhelpful "program too large" guess libbpf: Fix anonymous type check in CO-RE logic bpf: Compute map_btf_id during build time selftests/bpf: Add test for strict BTF type check selftests/bpf: Add verifier tests for kptr selftests/bpf: Add C tests for kptr libbpf: Add kptr type tag macros to bpf_helpers.h bpf: Make BTF type match stricter for release arguments bpf: Teach verifier about kptr_get kfunc helpers bpf: Wire up freeing of referenced kptr bpf: Populate pairs of btf_id and destructor kfunc in btf bpf: Adapt copy_map_value for multiple offset case ... ==================== Link: https://lore.kernel.org/r/20220427224758.20976-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-27Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfJakub Kicinski
Daniel Borkmann says: ==================== pull-request: bpf 2022-04-27 We've added 5 non-merge commits during the last 20 day(s) which contain a total of 6 files changed, 34 insertions(+), 12 deletions(-). The main changes are: 1) Fix xsk sockets when rx and tx are separately bound to the same umem, also fix xsk copy mode combined with busy poll, from Maciej Fijalkowski. 2) Fix BPF tunnel/collect_md helpers with bpf_xmit lwt hook usage which triggered a crash due to invalid metadata_dst access, from Eyal Birger. 3) Fix release of page pool in XDP live packet mode, from Toke Høiland-Jørgensen. 4) Fix potential NULL pointer dereference in kretprobes, from Adam Zabrocki. (Masami & Steven preferred this small fix to be routed via bpf tree given it's follow-up fix to Masami's rethook work that went via bpf earlier, too.) * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: xsk: Fix possible crash when multiple sockets are created kprobes: Fix KRETPROBES when CONFIG_KRETPROBE_ON_RETHOOK is set bpf, lwt: Fix crash when using bpf_skb_set_tunnel_key() from bpf_xmit lwt hook bpf: Fix release of page_pool in BPF_PROG_RUN in test runner xsk: Fix l2fwd for copy mode + busy poll combo ==================== Link: https://lore.kernel.org/r/20220427212748.9576-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-27tracing: Remove check of list iterator against head past the loop bodyJakob Koschel
When list_for_each_entry() completes the iteration over the whole list without breaking the loop, the iterator value will be a bogus pointer computed based on the head element. While it is safe to use the pointer to determine if it was computed based on the head element, either with list_entry_is_head() or &pos->member == head, using the iterator variable after the loop should be avoided. In preparation to limit the scope of a list iterator to the list traversal loop, use a dedicated pointer to point to the found element [1]. Link: https://lkml.kernel.org/r/20220427170734.819891-5-jakobkoschel@gmail.com Cc: Ingo Molnar <mingo@redhat.com> Link: https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/ Signed-off-by: Jakob Koschel <jakobkoschel@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-04-27tracing: Replace usage of found with dedicated list iterator variableJakob Koschel
To move the list iterator variable into the list_for_each_entry_*() macro in the future it should be avoided to use the list iterator variable after the loop body. To *never* use the list iterator variable after the loop it was concluded to use a separate iterator variable instead of a found boolean [1]. This removes the need to use a found variable and simply checking if the variable was set, can determine if the break/goto was hit. Link: https://lkml.kernel.org/r/20220427170734.819891-4-jakobkoschel@gmail.com Cc: Ingo Molnar <mingo@redhat.com> Link: https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/ [1] Signed-off-by: Jakob Koschel <jakobkoschel@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-04-27tracing: Remove usage of list iterator variable after the loopJakob Koschel
In preparation to limit the scope of a list iterator to the list traversal loop, use a dedicated pointer to point to the found element [1]. Link: https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/ [1] Link: https://lkml.kernel.org/r/20220427170734.819891-3-jakobkoschel@gmail.com Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Jakob Koschel <jakobkoschel@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-04-27tracing: Remove usage of list iterator after the loop bodyJakob Koschel
In preparation to limit the scope of the list iterator variable to the traversal loop, use a dedicated pointer to point to the found element [1]. Before, the code implicitly used the head when no element was found when using &pos->list. Since the new variable is only set if an element was found, the head needs to be used explicitly if the variable is NULL. Link: https://lkml.kernel.org/r/20220427170734.819891-2-jakobkoschel@gmail.com Cc: Ingo Molnar <mingo@redhat.com> Link: https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/ [1] Signed-off-by: Jakob Koschel <jakobkoschel@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-04-27tracing: Introduce trace clock taiKurt Kanzenbach
A fast/NMI safe accessor for CLOCK_TAI has been introduced. Use it for adding the additional trace clock "tai". Link: https://lkml.kernel.org/r/20220414091805.89667-3-kurt@linutronix.de Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-04-27ring-buffer: Have 32 bit time stamps use all 64 bitsSteven Rostedt (Google)
When the new logic was made to handle deltas of events from interrupts that interrupted other events, it required 64 bit local atomics. Unfortunately, 64 bit local atomics are expensive on 32 bit architectures. Thus, commit 10464b4aa605e ("ring-buffer: Add rb_time_t 64 bit operations for speeding up 32 bit") created a type of seq lock timer for 32 bits. It used two 32 bit local atomics, but required 2 bits from them each for synchronization, making it only 60 bits. Add a new "msb" field to hold the extra 4 bits that are cut off. Link: https://lore.kernel.org/all/20220426175338.3807ca4f@gandalf.local.home/ Link: https://lkml.kernel.org/r/20220427170812.53cc7139@gandalf.local.home Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-04-27ring-buffer: Have absolute time stamps handle large numbersSteven Rostedt (Google)
There's an absolute timestamp event in the ring buffer, but this only saves 59 bits of the timestamp, as the 5 MSB is used for meta data (stating it is an absolute time stamp). This was never an issue as all the clocks currently in use never used those 5 MSB. But now there's a new clock (TAI) that does. To handle this case, when reading an absolute timestamp, a previous full timestamp is passed in, and the 5 MSB of that timestamp is OR'd to the absolute timestamp (if any of the 5 MSB are set), and then to test for overflow, if the new result is smaller than the passed in previous timestamp, then 1 << 59 is added to it. All the extra processing is done on the reader "slow" path, with the exception of the "too big delta" check, and the reading of timestamps for histograms. Note, libtraceevent will need to be updated to handle this case as well. But this is not a user space regression, as user space was never able to handle any timestamps that used more than 59 bits. Link: https://lore.kernel.org/all/20220426175338.3807ca4f@gandalf.local.home/ Link: https://lkml.kernel.org/r/20220427153339.16c33f75@gandalf.local.home Cc: Tom Zanussi <zanussi@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-04-27x86/split_lock: Make life miserable for split lockersTony Luck
In https://lore.kernel.org/all/87y22uujkm.ffs@tglx/ Thomas said: Its's simply wishful thinking that stuff gets fixed because of a WARN_ONCE(). This has never worked. The only thing which works is to make stuff fail hard or slow it down in a way which makes it annoying enough to users to complain. He was talking about WBINVD. But it made me think about how we use the split lock detection feature in Linux. Existing code has three options for applications: 1) Don't enable split lock detection (allow arbitrary split locks) 2) Warn once when a process uses split lock, but let the process keep running with split lock detection disabled 3) Kill process that use split locks Option 2 falls into the "wishful thinking" territory that Thomas warns does nothing. But option 3 might not be viable in a situation with legacy applications that need to run. Hence make option 2 much stricter to "slow it down in a way which makes it annoying". Primary reason for this change is to provide better quality of service to the rest of the applications running on the system. Internal testing shows that even with many processes splitting locks, performance for the rest of the system is much more responsive. The new "warn" mode operates like this. When an application tries to execute a bus lock the #AC handler. 1) Delays (interruptibly) 10 ms before moving to next step. 2) Blocks (interruptibly) until it can get the semaphore If interrupted, just return. Assume the signal will either kill the task, or direct execution away from the instruction that is trying to get the bus lock. 3) Disables split lock detection for the current core 4) Schedules a work queue to re-enable split lock detect in 2 jiffies 5) Returns The work queue that re-enables split lock detection also releases the semaphore. There is a corner case where a CPU may be taken offline while split lock detection is disabled. A CPU hotplug handler handles this case. Old behaviour was to only print the split lock warning on the first occurrence of a split lock from a task. Preserve that by adding a flag to the task structure that suppresses subsequent split lock messages from that task. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20220310204854.31752-2-tony.luck@intel.com
2022-04-26tracing: make tracer_init_tracefs initcall asynchronousMark-PK Tsai
Move trace_eval_init() to subsys_initcall to make it start earlier. And to avoid tracer_init_tracefs being blocked by trace_event_sem which trace_eval_init() hold [1], queue tracer_init_tracefs() to eval_map_wq to let the two works being executed sequentially. It can speed up the initialization of kernel as result of making tracer_init_tracefs asynchronous. On my arm64 platform, it reduce ~20ms of 125ms which total time do_initcalls spend. Link: https://lkml.kernel.org/r/20220426122407.17042-3-mark-pk.tsai@mediatek.com [1]: https://lore.kernel.org/r/68d7b3327052757d0cd6359a6c9015a85b437232.camel@pengutronix.de Signed-off-by: Mark-PK Tsai <mark-pk.tsai@mediatek.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-04-26tracing: Avoid adding tracer option before update_tracer_optionsMark-PK Tsai
To prepare for support asynchronous tracer_init_tracefs initcall, avoid calling create_trace_option_files before __update_tracer_options. Otherwise, create_trace_option_files will show warning because some tracers in trace_types list are already in tr->topts. For example, hwlat_tracer call register_tracer in late_initcall, and global_trace.dir is already created in tracing_init_dentry, hwlat_tracer will be put into tr->topts. Then if the __update_tracer_options is executed after hwlat_tracer registered, create_trace_option_files find that hwlat_tracer is already in tr->topts. Link: https://lkml.kernel.org/r/20220426122407.17042-2-mark-pk.tsai@mediatek.com Link: https://lore.kernel.org/lkml/20220322133339.GA32582@xsang-OptiPlex-9020/ Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Mark-PK Tsai <mark-pk.tsai@mediatek.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-04-26ring-buffer: Simplify if-if to if-elseWan Jiabing
Use if and else instead of if(A) and if (!A). Link: https://lkml.kernel.org/r/20220426070628.167565-1-wanjiabing@vivo.com Signed-off-by: Wan Jiabing <wanjiabing@vivo.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-04-26tracing: Use WARN instead of printk and WARN_ONGuo Zhengkui
Use `WARN(cond, ...)` instead of `if (cond)` + `printk(...)` + `WARN_ON(1)`. Link: https://lkml.kernel.org/r/20220424131932.3606-1-guozhengkui@vivo.com Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Guo Zhengkui <guozhengkui@vivo.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-04-26tracing: Fix sleeping function called from invalid context on RT kernelJun Miao
When setting bootparams="trace_event=initcall:initcall_start tp_printk=1" in the cmdline, the output_printk() was called, and the spin_lock_irqsave() was called in the atomic and irq disable interrupt context suitation. On the PREEMPT_RT kernel, these locks are replaced with sleepable rt-spinlock, so the stack calltrace will be triggered. Fix it by raw_spin_lock_irqsave when PREEMPT_RT and "trace_event=initcall:initcall_start tp_printk=1" enabled. BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:46 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0 preempt_count: 2, expected: 0 RCU nest depth: 0, expected: 0 Preemption disabled at: [<ffffffff8992303e>] try_to_wake_up+0x7e/0xba0 CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.17.1-rt17+ #19 34c5812404187a875f32bee7977f7367f9679ea7 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x60/0x8c dump_stack+0x10/0x12 __might_resched.cold+0x11d/0x155 rt_spin_lock+0x40/0x70 trace_event_buffer_commit+0x2fa/0x4c0 ? map_vsyscall+0x93/0x93 trace_event_raw_event_initcall_start+0xbe/0x110 ? perf_trace_initcall_finish+0x210/0x210 ? probe_sched_wakeup+0x34/0x40 ? ttwu_do_wakeup+0xda/0x310 ? trace_hardirqs_on+0x35/0x170 ? map_vsyscall+0x93/0x93 do_one_initcall+0x217/0x3c0 ? trace_event_raw_event_initcall_level+0x170/0x170 ? push_cpu_stop+0x400/0x400 ? cblist_init_generic+0x241/0x290 kernel_init_freeable+0x1ac/0x347 ? _raw_spin_unlock_irq+0x65/0x80 ? rest_init+0xf0/0xf0 kernel_init+0x1e/0x150 ret_from_fork+0x22/0x30 </TASK> Link: https://lkml.kernel.org/r/20220419013910.894370-1-jun.miao@intel.com Signed-off-by: Jun Miao <jun.miao@intel.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-04-26tracing: Change `if (strlen(glob))` to `if (glob[0])`Ammar Faizi
No need to traverse to the end of string. If the first byte is not a NUL char, it's guaranteed `if (strlen(glob))` is true. Link: https://lkml.kernel.org/r/20220417185630.199062-3-ammarfaizi2@gnuweeb.org Cc: Ingo Molnar <mingo@redhat.com> Cc: GNU/Weeb Mailing List <gwml@vger.gnuweeb.org> Signed-off-by: Ammar Faizi <ammarfaizi2@gnuweeb.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-04-26tracing: Return -EINVAL if WARN_ON(!glob) triggered in ↵Ammar Faizi
event_hist_trigger_parse() If `WARN_ON(!glob)` is ever triggered, we will still continue executing the next lines. This will trigger the more serious problem, a NULL pointer dereference bug. Just return -EINVAL if @glob is NULL. Link: https://lkml.kernel.org/r/20220417185630.199062-2-ammarfaizi2@gnuweeb.org Cc: Ingo Molnar <mingo@redhat.com> Cc: GNU/Weeb Mailing List <gwml@vger.gnuweeb.org> Signed-off-by: Ammar Faizi <ammarfaizi2@gnuweeb.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-04-26tracing: Make tp_printk work on syscall tracepointsJeff Xie
Currently the tp_printk option has no effect on syscall tracepoint. When adding the kernel option parameter tp_printk, then: echo 1 > /sys/kernel/debug/tracing/events/syscalls/enable When running any application, no trace information is printed on the terminal. Now added printk for syscall tracepoints. Link: https://lkml.kernel.org/r/20220410145025.681144-1-xiehuan09@gmail.com Signed-off-by: Jeff Xie <xiehuan09@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-04-26tracing: Fix tracing_map_sort_entries() kernel-doc commentYang Li
Add the description of @n_sort_keys and make @sort_key -> @sort_keys in tracing_map_sort_entries() kernel-doc comment to remove warnings found by running scripts/kernel-doc, which is caused by using 'make W=1'. kernel/trace/tracing_map.c:1073: warning: Function parameter or member 'sort_keys' not described in 'tracing_map_sort_entries' kernel/trace/tracing_map.c:1073: warning: Function parameter or member 'n_sort_keys' not described in 'tracing_map_sort_entries' kernel/trace/tracing_map.c:1073: warning: Excess function parameter 'sort_key' description in 'tracing_map_sort_entries' Link: https://lkml.kernel.org/r/20220402072015.45864-1-yang.lee@linux.alibaba.com Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-04-26tracing: Fix kernel-docJiapeng Chong
Fix the following W=1 kernel warnings: kernel/trace/trace.c:1181: warning: expecting prototype for tracing_snapshot_cond_data(). Prototype was for tracing_cond_snapshot_data() instead. Link: https://lkml.kernel.org/r/20220218100849.122038-1-jiapeng.chong@linux.alibaba.com Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-04-26tracing: Fix inconsistent style of mini-HOWTOOscar Shiang
Each description should start with a hyphen and a space. Insert spaces to fix it. Link: https://lkml.kernel.org/r/TYCP286MB19130AA4A9C6FC5A8793DED2A1359@TYCP286MB1913.JPNP286.PROD.OUTLOOK.COM Signed-off-by: Oscar Shiang <oscar0225@livemail.tw> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-04-26tracing: Separate hist state updates from hist registrationTom Zanussi
hist_register_trigger() handles both new hist registration as well as existing hist registration through event_command.reg(). Adding a new function, existing_hist_update_only(), that checks and updates existing histograms and exits after doing so allows the confusing logic in event_hist_trigger_parse() to be simplified. Link: https://lkml.kernel.org/r/211b2cd3e3d7e00f4f8ad45ef8b33063da6a7e05.1644010576.git.zanussi@kernel.org Signed-off-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-04-26tracing: Have existing event_command.parse() implementations use helpersTom Zanussi
Simplify the existing event_command.parse() implementations by having them make use of the helper functions previously introduced. Link: https://lkml.kernel.org/r/b353e3427a81f9d3adafd98fd7d73e78a8209f43.1644010576.git.zanussi@kernel.org Signed-off-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>