summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2022-10-21bpf: Use __llist_del_all() whenever possbile during memory drainingHou Tao
Except for waiting_for_gp list, there are no concurrent operations on free_by_rcu, free_llist and free_llist_extra lists, so use __llist_del_all() instead of llist_del_all(). waiting_for_gp list can be deleted by RCU callback concurrently, so still use llist_del_all(). Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20221021114913.60508-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-21bpf: Wait for busy refill_work when destroying bpf memory allocatorHou Tao
A busy irq work is an unfinished irq work and it can be either in the pending state or in the running state. When destroying bpf memory allocator, refill_work may be busy for PREEMPT_RT kernel in which irq work is invoked in a per-CPU RT-kthread. It is also possible for kernel with arch_irq_work_has_interrupt() being false (e.g. 1-cpu arm32 host or mips) and irq work is inovked in timer interrupt. The busy refill_work leads to various issues. The obvious one is that there will be concurrent operations on free_by_rcu and free_list between irq work and memory draining. Another one is call_rcu_in_progress will not be reliable for the checking of pending RCU callback because do_call_rcu() may have not been invoked by irq work yet. The other is there will be use-after-free if irq work is freed before the callback of irq work is invoked as shown below: BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor instruction fetch in kernel mode #PF: error_code(0x0010) - not-present page PGD 12ab94067 P4D 12ab94067 PUD 1796b4067 PMD 0 Oops: 0010 [#1] PREEMPT_RT SMP CPU: 5 PID: 64 Comm: irq_work/5 Not tainted 6.0.0-rt11+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) RIP: 0010:0x0 Code: Unable to access opcode bytes at 0xffffffffffffffd6. RSP: 0018:ffffadc080293e78 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffffcdc07fb6a388 RCX: ffffa05000a2e000 RDX: ffffa05000a2e000 RSI: ffffffff96cc9827 RDI: ffffcdc07fb6a388 ...... Call Trace: <TASK> irq_work_single+0x24/0x60 irq_work_run_list+0x24/0x30 run_irq_workd+0x23/0x30 smpboot_thread_fn+0x203/0x300 kthread+0x126/0x150 ret_from_fork+0x1f/0x30 </TASK> Considering the ease of concurrency handling, no overhead for irq_work_sync() under non-PREEMPT_RT kernel and has-irq-work-interrupt kernel and the short wait time used for irq_work_sync() under PREEMPT_RT (When running two test_maps on PREEMPT_RT kernel and 72-cpus host, the max wait time is about 8ms and the 99th percentile is 10us), just using irq_work_sync() to wait for busy refill_work to complete before memory draining and memory freeing. Fixes: 7c8199e24fa0 ("bpf: Introduce any context BPF specific memory allocator.") Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20221021114913.60508-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-21Merge tag 'block-6.1-2022-10-20' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: - NVMe pull request via Christoph: - fix nvme-hwmon for DMA non-cohehrent architectures (Serge Semin) - add a nvme-hwmong maintainer (Christoph Hellwig) - fix error pointer dereference in error handling (Dan Carpenter) - fix invalid memory reference in nvmet_subsys_attr_qid_max_show (Daniel Wagner) - don't limit the DMA segment size in nvme-apple (Russell King) - fix workqueue MEM_RECLAIM flushing dependency (Sagi Grimberg) - disable write zeroes on various Kingston SSDs (Xander Li) - fix a memory leak with block device tracing (Ye) - flexible-array fix for ublk (Yushan) - document the ublk recovery feature from this merge window (ZiyangZhang) - remove dead bfq variable in struct (Yuwei) - error handling rq clearing fix (Yu) - add an IRQ safety check for the cached bio freeing (Pavel) - drbd bio cloning fix (Christoph) * tag 'block-6.1-2022-10-20' of git://git.kernel.dk/linux: blktrace: remove unnessary stop block trace in 'blk_trace_shutdown' blktrace: fix possible memleak in '__blk_trace_remove' blktrace: introduce 'blk_trace_{start,stop}' helper bio: safeguard REQ_ALLOC_CACHE bio put block, bfq: remove unused variable for bfq_queue drbd: only clone bio if we have a backing device ublk_drv: use flexible-array member instead of zero-length array nvmet: fix invalid memory reference in nvmet_subsys_attr_qid_max_show nvmet: fix workqueue MEM_RECLAIM flushing dependency nvme-hwmon: kmalloc the NVME SMART log buffer nvme-hwmon: consistently ignore errors from nvme_hwmon_init nvme: add Guenther as nvme-hwmon maintainer nvme-apple: don't limit DMA segement size nvme-pci: disable write zeroes on various Kingston SSD nvme: fix error pointer dereference in error handling Documentation: document ublk user recovery feature blk-mq: fix null pointer dereference in blk_mq_clear_rq_mapping()
2022-10-21Merge tag 'mm-hotfixes-stable-2022-10-20' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morron: "Seventeen hotfixes, mainly for MM. Five are cc:stable and the remainder address post-6.0 issues" * tag 'mm-hotfixes-stable-2022-10-20' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: nouveau: fix migrate_to_ram() for faulting page mm/huge_memory: do not clobber swp_entry_t during THP split hugetlb: fix memory leak associated with vma_lock structure mm/page_alloc: reduce potential fragmentation in make_alloc_exact() mm: /proc/pid/smaps_rollup: fix maple tree search mm,hugetlb: take hugetlb_lock before decrementing h->resv_huge_pages mm/mmap: fix MAP_FIXED address return on VMA merge mm/mmap.c: __vma_adjust(): suppress uninitialized var warning mm/mmap: undo ->mmap() when mas_preallocate() fails init: Kconfig: fix spelling mistake "satify" -> "satisfy" ocfs2: clear dinode links count in case of error ocfs2: fix BUG when iput after ocfs2_mknod fails gcov: support GCC 12.1 and newer compilers zsmalloc: zs_destroy_pool: add size_class NULL check mm/mempolicy: fix mbind_range() arguments to vma_merge() mailmap: update email for Qais Yousef mailmap: update Dan Carpenter's email address
2022-10-21srcu: Debug NMI safety even on archs that don't require itFrederic Weisbecker
Currently the NMI safety debugging is only performed on architectures that don't support NMI-safe this_cpu_inc(). Reorder the code so that other architectures like x86 also detect bad uses. [ paulmck: Apply kernel test robot, Stephen Rothwell, and Zqiang feedback. ] Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-21srcu: Explain the reason behind the read side critical section on GP startFrederic Weisbecker
Tell about the need to protect against concurrent updaters who may overflow the GP counter behind the current update. Reported-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-21srcu: Warn when NMI-unsafe API is used in NMIFrederic Weisbecker
Using the NMI-unsafe reader API from within an NMI handler is very likely to be buggy for three reasons: 1) NMIs aren't strictly re-entrant (a pending nested NMI will execute at the end of the current one) so it should be fine to use a non-atomic increment here. However, breakpoints can still interrupt NMIs and if a breakpoint callback has a reader on that same ssp, a racy increment can happen. 2) If the only reader site for a given srcu_struct structure is in an NMI handler, then RCU should be used instead of SRCU. 3) Because of the previous reason (2), an srcu_struct structure having an SRCU read side critical section in an NMI handler is likely to have another one from a task context. For all these reasons, warn if an NMI-unsafe reader API is used from an NMI handler. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-21rcu: Fix __this_cpu_read() lockdep warning in rcu_force_quiescent_state()Zqiang
Running rcutorture with non-zero fqs_duration module parameter in a kernel built with CONFIG_PREEMPTION=y results in the following splat: BUG: using __this_cpu_read() in preemptible [00000000] code: rcu_torture_fqs/398 caller is __this_cpu_preempt_check+0x13/0x20 CPU: 3 PID: 398 Comm: rcu_torture_fqs Not tainted 6.0.0-rc1-yoctodev-standard+ Call Trace: <TASK> dump_stack_lvl+0x5b/0x86 dump_stack+0x10/0x16 check_preemption_disabled+0xe5/0xf0 __this_cpu_preempt_check+0x13/0x20 rcu_force_quiescent_state.part.0+0x1c/0x170 rcu_force_quiescent_state+0x1e/0x30 rcu_torture_fqs+0xca/0x160 ? rcu_torture_boost+0x430/0x430 kthread+0x192/0x1d0 ? kthread_complete_and_exit+0x30/0x30 ret_from_fork+0x22/0x30 </TASK> The problem is that rcu_force_quiescent_state() uses __this_cpu_read() in preemptible code instead of the proper raw_cpu_read(). This commit therefore changes __this_cpu_read() to raw_cpu_read(). Signed-off-by: Zqiang <qiang1.zhang@intel.com> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-21rcu-tasks: Make grace-period-age message human-readablePaul E. McKenney
This commit adds a few words to the informative message that appears every ten seconds in RCU Tasks and RCU Tasks Trace grace periods. This message currently reads as follows: rcu_tasks_wait_gp: rcu_tasks grace period 1046 is 10088 jiffies old. After this change, it provides additional context, instead reading as follows: rcu_tasks_wait_gp: rcu_tasks grace period number 1046 (since boot) is 10088 jiffies old. Reported-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-21rcu: Remove rcu_is_idle_cpu()Yipeng Zou
The commit 3fcd6a230fa7 ("x86/cpu: Avoid cpuinfo-induced IPIing of idle CPUs") introduced rcu_is_idle_cpu() in order to identify the current CPU idle state. But commit f3eca381bd49 ("x86/aperfmperf: Replace arch_freq_get_on_cpu()") switched to using MAX_SAMPLE_AGE, so rcu_is_idle_cpu() is no longer used. This commit therefore removes it. Fixes: f3eca381bd49 ("x86/aperfmperf: Replace arch_freq_get_on_cpu()") Signed-off-by: Yipeng Zou <zouyipeng@huawei.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-20gcov: support GCC 12.1 and newer compilersMartin Liska
Starting with GCC 12.1, the created .gcda format can't be read by gcov tool. There are 2 significant changes to the .gcda file format that need to be supported: a) [gcov: Use system IO buffering] (23eb66d1d46a34cb28c4acbdf8a1deb80a7c5a05) changed that all sizes in the format are in bytes and not in words (4B) b) [gcov: make profile merging smarter] (72e0c742bd01f8e7e6dcca64042b9ad7e75979de) add a new checksum to the file header. Tested with GCC 7.5, 10.4, 12.2 and the current master. Link: https://lkml.kernel.org/r/624bda92-f307-30e9-9aaa-8cc678b2dfb2@suse.cz Signed-off-by: Martin Liska <mliska@suse.cz> Tested-by: Peter Oberparleiter <oberpar@linux.ibm.com> Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-20bpf: Fix dispatcher patchable function entry to 5 bytes nopJiri Olsa
The patchable_function_entry(5) might output 5 single nop instructions (depends on toolchain), which will clash with bpf_arch_text_poke check for 5 bytes nop instruction. Adding early init call for dispatcher that checks and change the patchable entry into expected 5 nop instruction if needed. There's no need to take text_mutex, because we are using it in early init call which is called at pre-smp time. Fixes: ceea991a019c ("bpf: Move bpf_dispatcher function out of ftrace locations") Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20221018075934.574415-1-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-20rcu: Keep synchronize_rcu() from enabling irqs in early bootPaul E. McKenney
Making polled RCU grace periods account for expedited grace periods required acquiring the leaf rcu_node structure's lock during early boot, but after rcu_init() was called. This lock is irq-disabled, but the code incorrectly assumes that irqs are always disabled when invoking synchronize_rcu(). The exception is early boot before the scheduler has started, which means that upon return from synchronize_rcu(), irqs will be incorrectly enabled. This commit fixes this bug by using irqsave/irqrestore locking primitives. Fixes: bf95b2bc3e42 ("rcu: Switch polled grace-period APIs to ->gp_seq_polled") Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-20srcu: Check for consistent global per-srcu_struct NMI safetyPaul E. McKenney
This commit adds runtime checks to verify that a given srcu_struct uses consistent NMI-safe (or not) read-side primitives globally, but based on the per-CPU data. These global checks are made by the grace-period code that must scan the srcu_data structures anyway, and are done only in kernels built with CONFIG_PROVE_RCU=y. Link: https://lore.kernel.org/all/20220910221947.171557773@linutronix.de/ Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Ogness <john.ogness@linutronix.de> Cc: Petr Mladek <pmladek@suse.com>
2022-10-20srcu: Check for consistent per-CPU per-srcu_struct NMI safetyPaul E. McKenney
This commit adds runtime checks to verify that a given srcu_struct uses consistent NMI-safe (or not) read-side primitives on a per-CPU basis. Link: https://lore.kernel.org/all/20220910221947.171557773@linutronix.de/ Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Ogness <john.ogness@linutronix.de> Cc: Petr Mladek <pmladek@suse.com>
2022-10-20srcu: Create an srcu_read_lock_nmisafe() and srcu_read_unlock_nmisafe()Paul E. McKenney
On strict load-store architectures, the use of this_cpu_inc() by srcu_read_lock() and srcu_read_unlock() is not NMI-safe in TREE SRCU. To see this suppose that an NMI arrives in the middle of srcu_read_lock(), just after it has read ->srcu_lock_count, but before it has written the incremented value back to memory. If that NMI handler also does srcu_read_lock() and srcu_read_lock() on that same srcu_struct structure, then upon return from that NMI handler, the interrupted srcu_read_lock() will overwrite the NMI handler's update to ->srcu_lock_count, but leave unchanged the NMI handler's update by srcu_read_unlock() to ->srcu_unlock_count. This can result in a too-short SRCU grace period, which can in turn result in arbitrary memory corruption. If the NMI handler instead interrupts the srcu_read_unlock(), this can result in eternal SRCU grace periods, which is not much better. This commit therefore creates a pair of new srcu_read_lock_nmisafe() and srcu_read_unlock_nmisafe() functions, which allow SRCU readers in both NMI handlers and in process and IRQ context. It is bad practice to mix the existing and the new _nmisafe() primitives on the same srcu_struct structure. Use one set or the other, not both. Just to underline that "bad practice" point, using srcu_read_lock() at process level and srcu_read_lock_nmisafe() in your NMI handler will not, repeat NOT, work. If you do not immediately understand why this is the case, please review the earlier paragraphs in this commit log. [ paulmck: Apply kernel test robot feedback. ] [ paulmck: Apply feedback from Randy Dunlap. ] [ paulmck: Apply feedback from John Ogness. ] [ paulmck: Apply feedback from Frederic Weisbecker. ] Link: https://lore.kernel.org/all/20220910221947.171557773@linutronix.de/ Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Ogness <john.ogness@linutronix.de> Cc: Petr Mladek <pmladek@suse.com>
2022-10-20blktrace: remove unnessary stop block trace in 'blk_trace_shutdown'Ye Bin
As previous commit, 'blk_trace_cleanup' will stop block trace if block trace's state is 'Blktrace_running'. So remove unnessary stop block trace in 'blk_trace_shutdown'. Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221019033602.752383-4-yebin@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-20blktrace: fix possible memleak in '__blk_trace_remove'Ye Bin
When test as follows: step1: ioctl(sda, BLKTRACESETUP, &arg) step2: ioctl(sda, BLKTRACESTART, NULL) step3: ioctl(sda, BLKTRACETEARDOWN, NULL) step4: ioctl(sda, BLKTRACESETUP, &arg) Got issue as follows: debugfs: File 'dropped' in directory 'sda' already present! debugfs: File 'msg' in directory 'sda' already present! debugfs: File 'trace0' in directory 'sda' already present! And also find syzkaller report issue like "KASAN: use-after-free Read in relay_switch_subbuf" "https://syzkaller.appspot.com/bug?id=13849f0d9b1b818b087341691be6cc3ac6a6bfb7" If remove block trace without stop(BLKTRACESTOP) block trace, '__blk_trace_remove' will just set 'q->blk_trace' with NULL. However, debugfs file isn't removed, so will report file already present when call BLKTRACESETUP. static int __blk_trace_remove(struct request_queue *q) { struct blk_trace *bt; bt = rcu_replace_pointer(q->blk_trace, NULL, lockdep_is_held(&q->debugfs_mutex)); if (!bt) return -EINVAL; if (bt->trace_state != Blktrace_running) blk_trace_cleanup(q, bt); return 0; } If do test as follows: step1: ioctl(sda, BLKTRACESETUP, &arg) step2: ioctl(sda, BLKTRACESTART, NULL) step3: ioctl(sda, BLKTRACETEARDOWN, NULL) step4: remove sda There will remove debugfs directory which will remove recursively all file under directory. >> blk_release_queue >> debugfs_remove_recursive(q->debugfs_dir) So all files which created in 'do_blk_trace_setup' are removed, and 'dentry->d_inode' is NULL. But 'q->blk_trace' is still in 'running_trace_lock', 'trace_note_tsk' will traverse 'running_trace_lock' all nodes. >>trace_note_tsk >> trace_note >> relay_reserve >> relay_switch_subbuf >> d_inode(buf->dentry)->i_size To solve above issues, reference commit '5afedf670caf', call 'blk_trace_cleanup' unconditionally in '__blk_trace_remove' and first stop block trace in 'blk_trace_cleanup'. Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221019033602.752383-3-yebin@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-20blktrace: introduce 'blk_trace_{start,stop}' helperYe Bin
Introduce 'blk_trace_{start,stop}' helper. No functional changed. Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221019033602.752383-2-yebin@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-18Merge tag 'for-netdev' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2022-10-18 We've added 33 non-merge commits during the last 14 day(s) which contain a total of 31 files changed, 874 insertions(+), 538 deletions(-). The main changes are: 1) Add RCU grace period chaining to BPF to wait for the completion of access from both sleepable and non-sleepable BPF programs, from Hou Tao & Paul E. McKenney. 2) Improve helper UAPI by explicitly defining BPF_FUNC_xxx integer values. In the wild we have seen OS vendors doing buggy backports where helper call numbers mismatched. This is an attempt to make backports more foolproof, from Andrii Nakryiko. 3) Add libbpf *_opts API-variants for bpf_*_get_fd_by_id() functions, from Roberto Sassu. 4) Fix libbpf's BTF dumper for structs with padding-only fields, from Eduard Zingerman. 5) Fix various libbpf bugs which have been found from fuzzing with malformed BPF object files, from Shung-Hsi Yu. 6) Clean up an unneeded check on existence of SSE2 in BPF x86-64 JIT, from Jie Meng. 7) Fix various ASAN bugs in both libbpf and selftests when running the BPF selftest suite on arm64, from Xu Kuohai. 8) Fix missing bpf_iter_vma_offset__destroy() call in BPF iter selftest and use in-skeleton link pointer to remove an explicit bpf_link__destroy(), from Jiri Olsa. 9) Fix BPF CI breakage by pointing to iptables-legacy instead of relying on symlinked iptables which got upgraded to iptables-nft, from Martin KaFai Lau. 10) Minor BPF selftest improvements all over the place, from various others. * tag 'for-netdev' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (33 commits) bpf/docs: Update README for most recent vmtest.sh bpf: Use rcu_trace_implies_rcu_gp() for program array freeing bpf: Use rcu_trace_implies_rcu_gp() in local storage map bpf: Use rcu_trace_implies_rcu_gp() in bpf memory allocator rcu-tasks: Provide rcu_trace_implies_rcu_gp() selftests/bpf: Use sys_pidfd_open() helper when possible libbpf: Fix null-pointer dereference in find_prog_by_sec_insn() libbpf: Deal with section with no data gracefully libbpf: Use elf_getshdrnum() instead of e_shnum selftest/bpf: Fix error usage of ASSERT_OK in xdp_adjust_tail.c selftests/bpf: Fix error failure of case test_xdp_adjust_tail_grow selftest/bpf: Fix memory leak in kprobe_multi_test selftests/bpf: Fix memory leak caused by not destroying skeleton libbpf: Fix memory leak in parse_usdt_arg() libbpf: Fix use-after-free in btf_dump_name_dups selftests/bpf: S/iptables/iptables-legacy/ in the bpf_nf and xdp_synproxy test selftests/bpf: Alphabetize DENYLISTs selftests/bpf: Add tests for _opts variants of bpf_*_get_fd_by_id() libbpf: Introduce bpf_link_get_fd_by_id_opts() libbpf: Introduce bpf_btf_get_fd_by_id_opts() ... ==================== Link: https://lore.kernel.org/r/20221018210631.11211-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-18kcsan: Instrument memcpy/memset/memmove with newer ClangMarco Elver
With Clang version 16+, -fsanitize=thread will turn memcpy/memset/memmove calls in instrumented functions into __tsan_memcpy/__tsan_memset/__tsan_memmove calls respectively. Add these functions to the core KCSAN runtime, so that we (a) catch data races with mem* functions, and (b) won't run into linker errors with such newer compilers. Cc: stable@vger.kernel.org # v5.10+ Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-18rcutorture: Verify NUM_ACTIVE_RCU_POLL_OLDSTATEPaul E. McKenney
This commit adds code to the RTWS_POLL_GET case of rcu_torture_writer() to verify that the value of NUM_ACTIVE_RCU_POLL_OLDSTATE is sufficiently large Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-18rcutorture: Verify NUM_ACTIVE_RCU_POLL_FULL_OLDSTATEPaul E. McKenney
This commit adds code to the RTWS_POLL_GET_FULL case of rcu_torture_writer() to verify that the value of NUM_ACTIVE_RCU_POLL_FULL_OLDSTATE is sufficiently large. [ paulmck: Fix whitespace issue located by checkpatch.pl. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-18rcu: Fix missing nocb gp wake on rcu_barrier()Frederic Weisbecker
In preparation for RCU lazy changes, wake up the RCU nocb gp thread if needed after an entrain. This change prevents the RCU barrier callback from waiting in the queue for several seconds before the lazy callbacks in front of it are serviced. Reported-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-18rcu: Fix late wakeup when flush of bypass cblist happensJoel Fernandes (Google)
When the bypass cblist gets too big or its timeout has occurred, it is flushed into the main cblist. However, the bypass timer is still running and the behavior is that it would eventually expire and wake the GP thread. Since we are going to use the bypass cblist for lazy CBs, do the wakeup soon as the flush for "too big or too long" bypass list happens. Otherwise, long delays can happen for callbacks which get promoted from lazy to non-lazy. This is a good thing to do anyway (regardless of future lazy patches), since it makes the behavior consistent with behavior of other code paths where flushing into the ->cblist makes the GP kthread into a non-sleeping state quickly. [ Frederic Weisbecker: Changes to avoid unnecessary GP-thread wakeups plus comment changes. ] Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-18rcu: Simplify rcu_init_nohz() cpumask handlingZhen Lei
In kernels built with either CONFIG_RCU_NOCB_CPU_DEFAULT_ALL=y or CONFIG_NO_HZ_FULL=y, additional CPUs must be added to rcu_nocb_mask. Except that kernels booted without the rcu_nocbs= will not have allocated rcu_nocb_mask. And the current rcu_init_nohz() function uses its need_rcu_nocb_mask and offload_all local variables to track the rcu_nocb and nohz_full state. But there is a much simpler approach, namely creating a cpumask pointer to track the default and then using cpumask_available() to check the rcu_nocb_mask state. This commit takes this approach, thereby simplifying and shortening the rcu_init_nohz() function. Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Acked-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-18rcu: Use READ_ONCE() for lockless read of rnp->qsmaskJoel Fernandes (Google)
The rnp->qsmask is locklessly accessed from rcutree_dying_cpu(). This may help avoid load tearing due to concurrent access, KCSAN issues, and preserve sanity of people reading the mask in tracing. Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-18rcu: Synchronize ->qsmaskinitnext in rcu_boost_kthread_setaffinity()Pingfan Liu
Once either rcutree_online_cpu() or rcutree_dead_cpu() is invoked concurrently, the following rcu_boost_kthread_setaffinity() race can occur: CPU 1 CPU2 mask = rcu_rnp_online_cpus(rnp); ... mask = rcu_rnp_online_cpus(rnp); ... set_cpus_allowed_ptr(t, cm); set_cpus_allowed_ptr(t, cm); This results in CPU2's update being overwritten by that of CPU1, and thus the possibility of ->boost_kthread_task continuing to run on a to-be-offlined CPU. This commit therefore eliminates this race by relying on the pre-existing acquisition of ->boost_kthread_mutex to serialize the full process of changing the affinity of ->boost_kthread_task. Signed-off-by: Pingfan Liu <kernelfans@gmail.com> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-18rcu: Remove duplicate RCU exp QS report from rcu_report_dead()Zqiang
The rcu_report_dead() function invokes rcu_report_exp_rdp() in order to force an immediate expedited quiescent state on the outgoing CPU, and then it invokes rcu_preempt_deferred_qs() to provide any required deferred quiescent state of either sort. Because the call to rcu_preempt_deferred_qs() provides the expedited RCU quiescent state if requested, the call to rcu_report_exp_rdp() is potentially redundant. One possible issue is a concurrent start of a new expedited RCU grace period, but this situation is already handled correctly by __sync_rcu_exp_select_node_cpus(). This function will detect that the CPU is going offline via the error return from its call to smp_call_function_single(). In that case, it will retry, and eventually stop retrying due to rcu_report_exp_rdp() clearing the ->qsmaskinitnext bit corresponding to the target CPU. As a result, __sync_rcu_exp_select_node_cpus() will report the necessary quiescent state after dealing with any remaining CPU. This change assumes that control does not enter rcu_report_dead() within an RCU read-side critical section, but then again, the surviving call to rcu_preempt_deferred_qs() has always made this assumption. This commit therefore removes the call to rcu_report_exp_rdp(), thus relying on rcu_preempt_deferred_qs() to handle both normal and expedited quiescent states. Signed-off-by: Zqiang <qiang1.zhang@intel.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2022-10-18srcu: Convert ->srcu_lock_count and ->srcu_unlock_count to atomicPaul E. McKenney
NMI-safe variants of srcu_read_lock() and srcu_read_unlock() are needed by printk(), which on many architectures entails read-modify-write atomic operations. This commit prepares Tree SRCU for this change by making both ->srcu_lock_count and ->srcu_unlock_count by atomic_long_t. [ paulmck: Apply feedback from John Ogness. ] Link: https://lore.kernel.org/all/20220910221947.171557773@linutronix.de/ Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Ogness <john.ogness@linutronix.de> Cc: Petr Mladek <pmladek@suse.com>
2022-10-18bpf: Use rcu_trace_implies_rcu_gp() for program array freeingHou Tao
To support both sleepable and normal uprobe bpf program, the freeing of trace program array chains a RCU-tasks-trace grace period and a normal RCU grace period one after the other. With the introduction of rcu_trace_implies_rcu_gp(), __bpf_prog_array_free_sleepable_cb() can check whether or not a normal RCU grace period has also passed after a RCU-tasks-trace grace period has passed. If it is true, it is safe to invoke kfree() directly. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20221014113946.965131-5-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-18bpf: Use rcu_trace_implies_rcu_gp() in local storage mapHou Tao
Local storage map is accessible for both sleepable and non-sleepable bpf program, and its memory is freed by using both call_rcu_tasks_trace() and kfree_rcu() to wait for both RCU-tasks-trace grace period and RCU grace period to pass. With the introduction of rcu_trace_implies_rcu_gp(), both bpf_selem_free_rcu() and bpf_local_storage_free_rcu() can check whether or not a normal RCU grace period has also passed after a RCU-tasks-trace grace period has passed. If it is true, it is safe to call kfree() directly. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20221014113946.965131-4-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-18bpf: Use rcu_trace_implies_rcu_gp() in bpf memory allocatorHou Tao
The memory free logic in bpf memory allocator chains a RCU Tasks Trace grace period and a normal RCU grace period one after the other, so it can ensure that both sleepable and non-sleepable programs have finished. With the introduction of rcu_trace_implies_rcu_gp(), __free_rcu_tasks_trace() can check whether or not a normal RCU grace period has also passed after a RCU Tasks Trace grace period has passed. If it is true, freeing these elements directly, else freeing through call_rcu(). Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20221014113946.965131-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-18rcu-tasks: Provide rcu_trace_implies_rcu_gp()Paul E. McKenney
As an accident of implementation, an RCU Tasks Trace grace period also acts as an RCU grace period. However, this could change at any time. This commit therefore creates an rcu_trace_implies_rcu_gp() that currently returns true to codify this accident. Code relying on this accident must call this function to verify that this accident is still happening. Reported-by: Hou Tao <houtao@huaweicloud.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Link: https://lore.kernel.org/r/20221014113946.965131-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-18livepatch: Move the result-invariant calculation out of the loopZhen Lei
The calculation results of the variables 'func_addr' and 'func_size' are not affected by the for loop and do not change due to the changes of entries[i]. The performance can be improved by moving it outside the loop. No functional change. Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2022-10-17Merge tag 'cgroup-for-6.1-rc1-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - Fix a recent regression where a sleeping kernfs function is called with css_set_lock (spinlock) held - Revert the commit to enable cgroup1 support for cgroup_get_from_fd/file() Multiple users assume that the lookup only works for cgroup2 and breaks when fed a cgroup1 file. Instead, introduce a separate set of functions to lookup both v1 and v2 and use them where the user explicitly wants to support both versions. - Compat update for tools/perf/util/bpf_skel/bperf_cgroup.bpf.c. - Add Josef Bacik as a blkcg maintainer. * tag 'cgroup-for-6.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: blkcg: Update MAINTAINERS entry mm: cgroup: fix comments for get from fd/file helpers perf stat: Support old kernels for bperf cgroup counting bpf: cgroup_iter: support cgroup1 using cgroup fd cgroup: add cgroup_v1v2_get_from_[fd/file]() Revert "cgroup: enable cgroup_get_from_file() on cgroup1" cgroup: Reorganize css_set_lock and kernfs path processing
2022-10-17audit: unify audit_filter_{uring(), inode_name(), syscall()}Ankur Arora
audit_filter_uring(), audit_filter_inode_name() are substantially similar to audit_filter_syscall(). Move the core logic to __audit_filter_op() which can be parametrized for all three. On a Skylakex system, getpid() latency (all results aggregated across 12 boot cycles): Min Mean Median Max pstdev (ns) (ns) (ns) (ns) - 196.63 207.86 206.60 230.98 (+- 3.92%) + 183.73 196.95 192.31 232.49 (+- 6.04%) Performance counter stats for 'bin/getpid' (3 runs) go from: cycles 805.58 ( +- 4.11% ) instructions 1654.11 ( +- .05% ) IPC 2.06 ( +- 3.39% ) branches 430.02 ( +- .05% ) branch-misses 1.55 ( +- 7.09% ) L1-dcache-loads 440.01 ( +- .09% ) L1-dcache-load-misses 9.05 ( +- 74.03% ) to: cycles 765.37 ( +- 6.66% ) instructions 1677.07 ( +- 0.04% ) IPC 2.20 ( +- 5.90% ) branches 431.10 ( +- 0.04% ) branch-misses 1.60 ( +- 11.25% ) L1-dcache-loads 521.04 ( +- 0.05% ) L1-dcache-load-misses 6.92 ( +- 77.60% ) (Both aggregated over 12 boot cycles.) The increased L1-dcache-loads are due to some intermediate values now coming from the stack. The improvement in cycles is due to a slightly denser loop (the list parameter in the list_for_each_entry_rcu() exit check now comes from a register rather than a constant as before.) Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2022-10-17audit: cache ctx->major in audit_filter_syscall()Ankur Arora
ctx->major contains the current syscall number. This is, of course, a constant for the duration of the syscall. Unfortunately, GCC's alias analysis cannot prove that it is not modified via a pointer in the audit_filter_syscall() loop, and so always loads it from memory. In and of itself the load isn't very expensive (ops dependent on the ctx->major load are only used to determine the direction of control flow and have short dependence chains and, in any case the related branches get predicted perfectly in the fastpath) but still cache ctx->major in a local for two reasons: * ctx->major is in the first cacheline of struct audit_context and has similar alignment as audit_entry::list audit_entry. For cases with a lot of audit rules, doing this reduces one source of contention from a potentially busy cache-set. * audit_in_mask() (called in the hot loop in audit_filter_syscall()) does cast manipulation and error checking on ctx->major: audit_in_mask(const struct audit_krule *rule, unsigned long val): if (val > 0xffffffff) return false; word = AUDIT_WORD(val); if (word >= AUDIT_BITMASK_SIZE) return false; bit = AUDIT_BIT(val); return rule->mask[word] & bit; The clauses related to the rule need to be evaluated in the loop, but the rest is unnecessarily re-evaluated for every loop iteration. (Note, however, that most of these are cheap ALU ops and the branches are perfectly predicted. However, see discussion on cycles improvement below for more on why it is still worth hoisting.) On a Skylakex system change in getpid() latency (aggregated over 12 boot cycles): Min Mean Median Max pstdev (ns) (ns) (ns) (ns) - 201.30 216.14 216.22 228.46 (+- 1.45%) + 196.63 207.86 206.60 230.98 (+- 3.92%) Performance counter stats for 'bin/getpid' (3 runs) go from: cycles 836.89 ( +- .80% ) instructions 2000.19 ( +- .03% ) IPC 2.39 ( +- .83% ) branches 430.14 ( +- .03% ) branch-misses 1.48 ( +- 3.37% ) L1-dcache-loads 471.11 ( +- .05% ) L1-dcache-load-misses 7.62 ( +- 46.98% ) to: cycles 805.58 ( +- 4.11% ) instructions 1654.11 ( +- .05% ) IPC 2.06 ( +- 3.39% ) branches 430.02 ( +- .05% ) branch-misses 1.55 ( +- 7.09% ) L1-dcache-loads 440.01 ( +- .09% ) L1-dcache-load-misses 9.05 ( +- 74.03% ) (Both aggregated over 12 boot cycles.) instructions: we reduce around 8 instructions/iteration because some of the computation is now hoisted out of the loop (branch count does not change because GCC, for reasons unclear, only hoists the computations while keeping the basic-blocks.) cycles: improve by about 5% (in aggregate and looking at individual run numbers.) This is likely because we now waste fewer pipeline resources on unnecessary instructions which allows the control flow to speculatively execute further ahead shortening the execution of the loop a little. The final gating factor on the performance of this loop remains the long dependence chain due to the linked-list load. Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2022-10-17bpf: prevent decl_tag from being referenced in func_protoStanislav Fomichev
Syzkaller was able to hit the following issue: ------------[ cut here ]------------ WARNING: CPU: 0 PID: 3609 at kernel/bpf/btf.c:1946 btf_type_id_size+0x2d5/0x9d0 kernel/bpf/btf.c:1946 Modules linked in: CPU: 0 PID: 3609 Comm: syz-executor361 Not tainted 6.0.0-syzkaller-02734-g0326074ff465 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/22/2022 RIP: 0010:btf_type_id_size+0x2d5/0x9d0 kernel/bpf/btf.c:1946 Code: ef e8 7f 8e e4 ff 41 83 ff 0b 77 28 f6 44 24 10 18 75 3f e8 6d 91 e4 ff 44 89 fe bf 0e 00 00 00 e8 20 8e e4 ff e8 5b 91 e4 ff <0f> 0b 45 31 f6 e9 98 02 00 00 41 83 ff 12 74 18 e8 46 91 e4 ff 44 RSP: 0018:ffffc90003cefb40 EFLAGS: 00010293 RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000000000 RDX: ffff8880259c0000 RSI: ffffffff81968415 RDI: 0000000000000005 RBP: ffff88801270ca00 R08: 0000000000000005 R09: 000000000000000e R10: 0000000000000011 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000011 R14: ffff888026ee6424 R15: 0000000000000011 FS: 000055555641b300(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000f2e258 CR3: 000000007110e000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> btf_func_proto_check kernel/bpf/btf.c:4447 [inline] btf_check_all_types kernel/bpf/btf.c:4723 [inline] btf_parse_type_sec kernel/bpf/btf.c:4752 [inline] btf_parse kernel/bpf/btf.c:5026 [inline] btf_new_fd+0x1926/0x1e70 kernel/bpf/btf.c:6892 bpf_btf_load kernel/bpf/syscall.c:4324 [inline] __sys_bpf+0xb7d/0x4cf0 kernel/bpf/syscall.c:5010 __do_sys_bpf kernel/bpf/syscall.c:5069 [inline] __se_sys_bpf kernel/bpf/syscall.c:5067 [inline] __x64_sys_bpf+0x75/0xb0 kernel/bpf/syscall.c:5067 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f0fbae41c69 Code: 28 c3 e8 2a 14 00 00 66 2e 0f 1f 84 00 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffc8aeb6228 EFLAGS: 00000246 ORIG_RAX: 0000000000000141 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f0fbae41c69 RDX: 0000000000000020 RSI: 0000000020000140 RDI: 0000000000000012 RBP: 00007f0fbae05e10 R08: 0000000000000000 R09: 0000000000000000 R10: 00000000ffffffff R11: 0000000000000246 R12: 00007f0fbae05ea0 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 </TASK> Looks like it tries to create a func_proto which return type is decl_tag. For the details, see Martin's spot on analysis in [0]. 0: https://lore.kernel.org/bpf/CAKH8qBuQDLva_hHxxBuZzyAcYNO4ejhovz6TQeVSk8HY-2SO6g@mail.gmail.com/T/#mea6524b3fcd6298347432226e81b1e6155efc62c Cc: Yonghong Song <yhs@fb.com> Cc: Martin KaFai Lau <martin.lau@kernel.org> Fixes: bd16dee66ae4 ("bpf: Add BTF_KIND_DECL_TAG typedef support") Reported-by: syzbot+d8bd751aef7c6b39a344@syzkaller.appspotmail.com Signed-off-by: Stanislav Fomichev <sdf@google.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20221015002444.2680969-2-sdf@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-10-17sched: Introduce struct balance_callback to avoid CFI mismatchesKees Cook
Introduce distinct struct balance_callback instead of performing function pointer casting which will trip CFI. Avoids warnings as found by Clang's future -Wcast-function-type-strict option: In file included from kernel/sched/core.c:84: kernel/sched/sched.h:1755:15: warning: cast from 'void (*)(struct rq *)' to 'void (*)(struct callback_head *)' converts to incompatible function type [-Wcast-function-type-strict] head->func = (void (*)(struct callback_head *))func; ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ No binary differences result from this change. This patch is a cleanup based on Brad Spengler/PaX Team's modifications to sched code in their last public patch of grsecurity/PaX based on my understanding of the code. Changes or omissions from the original code are mine and don't reflect the original grsecurity/PaX code. Reported-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Link: https://github.com/ClangBuiltLinux/linux/issues/1724 Link: https://lkml.kernel.org/r/20221008000758.2957718-1-keescook@chromium.org
2022-10-17sched/core: Fix comparison in sched_group_cookie_match()Lin Shengwang
In commit 97886d9dcd86 ("sched: Migration changes for core scheduling"), sched_group_cookie_match() was added to help determine if a cookie matches the core state. However, while it iterates the SMT group, it fails to actually use the RQ for each of the CPUs iterated, use cpu_rq(cpu) instead of rq to fix things. Fixes: 97886d9dcd86 ("sched: Migration changes for core scheduling") Signed-off-by: Lin Shengwang <linshengwang1@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221008022709.642-1-linshengwang1@huawei.com
2022-10-17x86/ftrace: Make it call depth tracking awarePeter Zijlstra
Since ftrace has trampolines, don't use thunks for the __fentry__ site but instead require that every function called from there includes accounting. This very much includes all the direct-call functions. Additionally, ftrace uses ROP tricks in two places: - return_to_handler(), and - ftrace_regs_caller() when pt_regs->orig_ax is set by a direct-call. return_to_handler() already uses a retpoline to replace an indirect-jump to defeat IBT, since this is a jump-type retpoline, make sure there is no accounting done and ALTERNATIVE the RET into a ret. ftrace_regs_caller() does much the same and gets the same treatment. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220915111148.927545073@infradead.org
2022-10-17kallsyms: Take callthunks into accountPeter Zijlstra
Since the pre-symbol function padding is an integral part of the symbol make kallsyms report it as part of the symbol by reporting it as sym-x instead of prev_sym+y. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220915111148.409656012@infradead.org
2022-10-17static_call: Add call depth tracking supportPeter Zijlstra
When indirect calls are switched to direct calls then it has to be ensured that the call target is not the function, but the call thunk when call depth tracking is enabled. But static calls are available before call thunks have been set up. Ensure a second run through the static call patching code after call thunks have been created. When call thunks are not enabled this has no side effects. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20220915111148.306100465@infradead.org
2022-10-17bpf: Fix sample_flags for bpf_perf_event_outputSumanth Korikkar
* Raw data is also filled by bpf_perf_event_output. * Add sample_flags to indicate raw data. * This eliminates the segfaults as shown below: Run ./samples/bpf/trace_output BUG pid 9 cookie 1001000000004 sized 4 BUG pid 9 cookie 1001000000004 sized 4 BUG pid 9 cookie 1001000000004 sized 4 Segmentation fault (core dumped) Fixes: 838d9bb62d13 ("perf: Use sample_flags for raw_data") Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Namhyung Kim <namhyung@kernel.org> Link: https://lkml.kernel.org/r/20221007081327.1047552-1-sumanthk@linux.ibm.com
2022-10-17perf: Fix missing SIGTRAPsPeter Zijlstra
Marco reported: Due to the implementation of how SIGTRAP are delivered if perf_event_attr::sigtrap is set, we've noticed 3 issues: 1. Missing SIGTRAP due to a race with event_sched_out() (more details below). 2. Hardware PMU events being disabled due to returning 1 from perf_event_overflow(). The only way to re-enable the event is for user space to first "properly" disable the event and then re-enable it. 3. The inability to automatically disable an event after a specified number of overflows via PERF_EVENT_IOC_REFRESH. The worst of the 3 issues is problem (1), which occurs when a pending_disable is "consumed" by a racing event_sched_out(), observed as follows: CPU0 | CPU1 --------------------------------+--------------------------- __perf_event_overflow() | perf_event_disable_inatomic() | pending_disable = CPU0 | ... | _perf_event_enable() | event_function_call() | task_function_call() | /* sends IPI to CPU0 */ <IPI> | ... __perf_event_enable() +--------------------------- ctx_resched() task_ctx_sched_out() ctx_sched_out() group_sched_out() event_sched_out() pending_disable = -1 </IPI> <IRQ-work> perf_pending_event() perf_pending_event_disable() /* Fails to send SIGTRAP because no pending_disable! */ </IRQ-work> In the above case, not only is that particular SIGTRAP missed, but also all future SIGTRAPs because 'event_limit' is not reset back to 1. To fix, rework pending delivery of SIGTRAP via IRQ-work by introduction of a separate 'pending_sigtrap', no longer using 'event_limit' and 'pending_disable' for its delivery. Additionally; and different to Marco's proposed patch: - recognise that pending_disable effectively duplicates oncpu for the case where it is set. As such, change the irq_work handler to use ->oncpu to target the event and use pending_* as boolean toggles. - observe that SIGTRAP targets the ctx->task, so the context switch optimization that carries contexts between tasks is invalid. If the irq_work were delayed enough to hit after a context switch the SIGTRAP would be delivered to the wrong task. - observe that if the event gets scheduled out (rotation/migration/context-switch/...) the irq-work would be insufficient to deliver the SIGTRAP when the event gets scheduled back in (the irq-work might still be pending on the old CPU). Therefore have event_sched_out() convert the pending sigtrap into a task_work which will deliver the signal at return_to_user. Fixes: 97ba62b27867 ("perf: Add support for SIGTRAP on perf events") Reported-by: Dmitry Vyukov <dvyukov@google.com> Debugged-by: Dmitry Vyukov <dvyukov@google.com> Reported-by: Marco Elver <elver@google.com> Debugged-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Marco Elver <elver@google.com> Tested-by: Marco Elver <elver@google.com>
2022-10-17timers: Replace in_irq() with in_hardirq()ye xingchen
Replace the obsolete and ambiguous macro in_irq() with new macro in_hardirq(). Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/r/20221012012629.334966-1-ye.xingchen@zte.com.cn
2022-10-16Merge tag 'random-6.1-rc1-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/crng/random Pull more random number generator updates from Jason Donenfeld: "This time with some large scale treewide cleanups. The intent of this pull is to clean up the way callers fetch random integers. The current rules for doing this right are: - If you want a secure or an insecure random u64, use get_random_u64() - If you want a secure or an insecure random u32, use get_random_u32() The old function prandom_u32() has been deprecated for a while now and is just a wrapper around get_random_u32(). Same for get_random_int(). - If you want a secure or an insecure random u16, use get_random_u16() - If you want a secure or an insecure random u8, use get_random_u8() - If you want secure or insecure random bytes, use get_random_bytes(). The old function prandom_bytes() has been deprecated for a while now and has long been a wrapper around get_random_bytes() - If you want a non-uniform random u32, u16, or u8 bounded by a certain open interval maximum, use prandom_u32_max() I say "non-uniform", because it doesn't do any rejection sampling or divisions. Hence, it stays within the prandom_*() namespace, not the get_random_*() namespace. I'm currently investigating a "uniform" function for 6.2. We'll see what comes of that. By applying these rules uniformly, we get several benefits: - By using prandom_u32_max() with an upper-bound that the compiler can prove at compile-time is ≤65536 or ≤256, internally get_random_u16() or get_random_u8() is used, which wastes fewer batched random bytes, and hence has higher throughput. - By using prandom_u32_max() instead of %, when the upper-bound is not a constant, division is still avoided, because prandom_u32_max() uses a faster multiplication-based trick instead. - By using get_random_u16() or get_random_u8() in cases where the return value is intended to indeed be a u16 or a u8, we waste fewer batched random bytes, and hence have higher throughput. This series was originally done by hand while I was on an airplane without Internet. Later, Kees and I worked on retroactively figuring out what could be done with Coccinelle and what had to be done manually, and then we split things up based on that. So while this touches a lot of files, the actual amount of code that's hand fiddled is comfortably small" * tag 'random-6.1-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: prandom: remove unused functions treewide: use get_random_bytes() when possible treewide: use get_random_u32() when possible treewide: use get_random_{u8,u16}() when possible, part 2 treewide: use get_random_{u8,u16}() when possible, part 1 treewide: use prandom_u32_max() when possible, part 2 treewide: use prandom_u32_max() when possible, part 1
2022-10-14Merge tag 'sched-psi-2022-10-14' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull PSI updates from Ingo Molnar: - Various performance optimizations, resulting in a 4%-9% speedup in the mmtests/config-scheduler-perfpipe micro-benchmark. - New interface to turn PSI on/off on a per cgroup level. * tag 'sched-psi-2022-10-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/psi: Per-cgroup PSI accounting disable/re-enable interface sched/psi: Cache parent psi_group to speed up group iteration sched/psi: Consolidate cgroup_psi() sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure sched/psi: Remove NR_ONCPU task accounting sched/psi: Optimize task switch inside shared cgroups again sched/psi: Move private helpers to sched/stats.h sched/psi: Save percpu memory when !psi_cgroups_enabled sched/psi: Don't create cgroup PSI files when psi_disabled sched/psi: Fix periodic aggregation shut off