summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2024-04-15rcu-tasks: Replace exit_tasks_rcu_start() initialization with WARN_ON_ONCE()Paul E. McKenney
Because the Tasks RCU ->rtp_exit_list is initialized at rcu_init() time while there is only one CPU running with interrupts disabled, it is not possible for an exiting task to encounter an uninitialized list. This commit therefore replaces the conditional initialization with a WARN_ON_ONCE(). Reported-by: Frederic Weisbecker <frederic@kernel.org> Closes: https://lore.kernel.org/all/ZdiNXmO3wRvmzPsr@lothringen/ Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2024-04-15rcu: Inform KCSAN of one-byte cmpxchg() in rcu_trc_cmpxchg_need_qs()Paul E. McKenney
Tasks Trace RCU needs a single-byte cmpxchg(), but no such thing exists. Therefore, rcu_trc_cmpxchg_need_qs() emulates one using field substitution and a four-byte cmpxchg(), such that the other three bytes are always atomically updated to their old values. This works, but results in false-positive KCSAN failures because as far as KCSAN knows, this cmpxchg() operation is updating all four bytes. This commit therefore encloses the cmpxchg() in a data_race() and adds a single-byte instrument_atomic_read_write(), thus telling KCSAN exactly what is going on so as to avoid the false positives. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Marco Elver <elver@google.com> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2024-04-15rcu: Make hotplug operations track GP state, not flagsPaul E. McKenney
Currently, there are rcu_data structure fields named ->rcu_onl_gp_seq and ->rcu_ofl_gp_seq that track the rcu_state.gp_flags field at the time of the corresponding CPU's last online or offline operation, respectively. However, this information is not particularly useful. It would be better to instead track the grace period state kept in rcu_state.gp_state. This would also be consistent with the initialization in rcu_boot_init_percpu_data(), which is to RCU_GP_CLEANED (an rcu_state.gp_state value), and also with the diagnostics in rcu_implicit_dynticks_qs(), whose format is consistent with an integer, not a bitmask. This commit therefore makes this change and changes the names to ->rcu_onl_gp_flags and ->rcu_ofl_gp_flags, respectively. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2024-04-15rcu: Mark loads from rcu_state.n_online_cpusPaul E. McKenney
The rcu_state.n_online_cpus value is only ever updated by CPU-hotplug operations, which are serialized. However, this value is read locklessly. This commit therefore marks those reads. While in the area, it also adds ASSERT_EXCLUSIVE_WRITER() calls just in case parallel CPU hotplug becomes a thing. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2024-04-15rcu: Mark writes to rcu_sync ->gp_count fieldPaul E. McKenney
The rcu_sync structure's ->gp_count field is updated under the protection of ->rss_lock, but read locklessly, and KCSAN noted the data race. This commit therefore uses WRITE_ONCE() to do this update to clearly document its racy nature. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2024-04-15rcu: Bring diagnostic read of rcu_state.gp_flags into alignmentPaul E. McKenney
This commit adds READ_ONCE() to a lockless diagnostic read from rcu_state.gp_flags to avoid giving the compiler any chance whatsoever of confusing the diagnostic state printed. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2024-04-15rcu: Remove redundant READ_ONCE() of rcu_state.gp_flags in tree.cPaul E. McKenney
Although it is functionally OK to do READ_ONCE() of a variable that cannot change, it is confusing and at best an accident waiting to happen. This commit therefore removes a number of READ_ONCE(rcu_state.gp_flags) instances from kernel/rcu/tree.c that are not needed due to updates to this field being excluded by virtue of holding the root rcu_node structure's ->lock. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Closes: https://lore.kernel.org/lkml/4857c5ef-bd8f-4670-87ac-0600a1699d05@paulmck-laptop/T/#mccb23c2a4902da4d3c750165329f8de056903c58 Reported-by: Julia Lawall <julia.lawall@inria.fr> Closes: https://lore.kernel.org/lkml/4857c5ef-bd8f-4670-87ac-0600a1699d05@paulmck-laptop/T/#md1b5c026584f9c3c7b0fbc9240dd7de584597b73 Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2024-04-15rcu: Make Tiny RCU explicitly disable preemptionPaul E. McKenney
Because Tiny RCU is used only in kernels built with either CONFIG_PREEMPT_NONE=y or CONFIG_PREEMPT_VOLUNTARY=y, there has not been any need for TINY RCU to explicitly disable preemption. However, the prospect of lazy preemption changes that, and preemption means that the non-atomic increment in synchronize_rcu() can be preempted, with the possibility that one of the increments is lost. This could cause failures for users of the APIs that poll RCU grace periods. This commit therefore adds the needed preempt_disable() and preempt_enable() call to Tiny RCU. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Ankur Arora <ankur.a.arora@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2024-04-15rcu: Remove redundant BH disabling in TINY_RCUPaul E. McKenney
The TINY_RCU rcu_process_callbacks() function is only ever invoked from a softirq handler, which means that BH is already disabled. This commit therefore removes the redundant local_bh_disable() and local_bh_ennable() from this function. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2024-04-15rcu: Create NEED_TASKS_RCU to factor out enablement logicPaul E. McKenney
Currently, if a Kconfig option depends on TASKS_RCU, it conditionally does "select TASKS_RCU if PREEMPTION". This works, but requires any change in this enablement logic to be replicated across all such "select" clauses. This commit therefore creates a new NEED_TASKS_RCU Kconfig option so that the default value of TASKS_RCU can depend on a combination of this new option and any needed enablement logic, so that this logic is in one place. While in the area, also anticipate a likely future change by adding PREEMPT_AUTO to that logic. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Ankur Arora <ankur.a.arora@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2024-04-15srcu: Make Tiny SRCU explicitly disable preemptionPaul E. McKenney
Because Tiny SRCU is used only in kernels built with either CONFIG_PREEMPT_NONE=y or CONFIG_PREEMPT_VOLUNTARY=y, there has not been any need for TINY SRCU to explicitly disable preemption. However, the prospect of lazy preemption changes that, and the lazy-preemption patches do result in rcutorture runs finding both too-short grace periods and grace-period hangs for Tiny SRCU. This commit therefore adds the needed preempt_disable() and preempt_enable() calls to Tiny SRCU. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Ankur Arora <ankur.a.arora@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2024-04-15rcu: Make TINY_RCU depend on !PREEMPT_RCU rather than !PREEMPTIONPaul E. McKenney
Right now, TINY_RCU depends on (!PREEMPTION && !SMP), which has served the kernel well for many years due to the fact that PREEMPT_RCU is normally a synonym for PREEMPTION. But with the advent of lazy preemption, it will be possible to have non-preemptible RCU in a preemptible kernel, so that kernels could be built with PREEMPT_RCU=n and PREEMPTION=y. This commit therefore makes TINY_RCU depend on (!PREEMPT_RCU && !SMP), thus allowing for a non-preemptible RCU in preemptible kernels. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Ankur Arora <ankur.a.arora@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2024-04-14perf/bpf: Mark perf_event_set_bpf_handler() and ↵Ingo Molnar
perf_event_free_bpf_handler() as inline too They can be unused with certain Kconfig variations: kernel/events/core.c:9622:13: warning: ‘perf_event_free_bpf_handler’ defined but not used [-Wunused-function] kernel/events/core.c:9586:12: warning: ‘perf_event_set_bpf_handler’ defined but not used [-Wunused-function] Since they are both single-use, mark them inline. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: linux-kernel@vger.kernel.org Cc: Kyle Huey <khuey@kylehuey.com>
2024-04-14perf/ring_buffer: Trigger IO signals for watermark_wakeupKyle Huey
perf_output_wakeup() already marks the perf event fd available for polling. Trigger IO signals with FASYNC too. Signed-off-by: Kyle Huey <khuey@kylehuey.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240413141618.4160-3-khuey@kylehuey.com
2024-04-14perf: Move perf_event_fasync() to perf_event.hKyle Huey
This will allow it to be called from perf_output_wakeup(). Signed-off-by: Kyle Huey <khuey@kylehuey.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240413141618.4160-2-khuey@kylehuey.com
2024-04-14Merge branch 'linus' into perf/core, to pick up perf/urgent fixesIngo Molnar
Pick up perf/urgent fixes that are upstream already, but not yet in the perf/core development branch. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-04-14Merge tag 'x86-urgent-2024-04-14' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc x86 fixes from Ingo Molnar: - Follow up fixes for the BHI mitigations code - Fix !SPECULATION_MITIGATIONS bug not turning off mitigations as expected - Work around an APIC emulation bug when the kernel is built with Clang and run as a SEV guest - Follow up x86 topology fixes * tag 'x86-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/cpu/amd: Move TOPOEXT enablement into the topology parser x86/cpu/amd: Make the NODEID_MSR union actually work x86/cpu/amd: Make the CPUID 0x80000008 parser correct x86/bugs: Replace CONFIG_SPECTRE_BHI_{ON,OFF} with CONFIG_MITIGATION_SPECTRE_BHI x86/bugs: Remove CONFIG_BHI_MITIGATION_AUTO and spectre_bhi=auto x86/bugs: Clarify that syscall hardening isn't a BHI mitigation x86/bugs: Fix BHI handling of RRSBA x86/bugs: Rename various 'ia32_cap' variables to 'x86_arch_cap_msr' x86/bugs: Cache the value of MSR_IA32_ARCH_CAPABILITIES x86/bugs: Fix BHI documentation x86/cpu: Actually turn off mitigations by default for SPECULATION_MITIGATIONS=n x86/topology: Don't update cpu_possible_map in topo_set_cpuids() x86/bugs: Fix return type of spectre_bhi_state() x86/apic: Force native_apic_mem_read() to use the MOV instruction
2024-04-14Merge tag 'timers-urgent-2024-04-14' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Ingo Molnar: - Address a (valid) W=1 build warning - Fix timer self-tests - Annotate a KCSAN warning wrt. accesses to the tick_do_timer_cpu global variable - Address a !CONFIG_BUG build warning * tag 'timers-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: selftests: kselftest: Fix build failure with NOLIBC selftests: timers: Fix abs() warning in posix_timers test selftests: kselftest: Mark functions that unconditionally call exit() as __noreturn selftests: timers: Fix posix_timers ksft_print_msg() warning selftests: timers: Fix valid-adjtimex signed left-shift undefined behavior bug: Fix no-return-statement warning with !CONFIG_BUG timekeeping: Use READ/WRITE_ONCE() for tick_do_timer_cpu selftests/timers/posix_timers: Reimplement check_timer_distribution() irqflags: Explicitly ignore lockdep_hrtimer_exit() argument
2024-04-14Merge tag 'dma-maping-6.9-2024-04-14' of ↵Linus Torvalds
git://git.infradead.org/users/hch/dma-mapping Pull dma-mapping fixes from Christoph Hellwig: - fix up swiotlb buffer padding even more (Petr Tesarik) - fix for partial dma_sync on swiotlb (Michael Kelley) - swiotlb debugfs fix (Dexuan Cui) * tag 'dma-maping-6.9-2024-04-14' of git://git.infradead.org/users/hch/dma-mapping: swiotlb: do not set total_used to 0 in swiotlb_create_debugfs_files() swiotlb: fix swiotlb_bounce() to do partial sync's correctly swiotlb: extend buffer pre-padding to alloc_align_mask if necessary
2024-04-12bpf: Fix a verifier verbose messageAnton Protopopov
Long ago a map file descriptor in a pseudo ldimm64 instruction could only be present as an immediate value insn[0].imm, and thus this value was used in a verbose verifier message printed when the file descriptor wasn't valid. Since addition of BPF_PSEUDO_MAP_IDX_VALUE/BPF_PSEUDO_MAP_IDX the insn[0].imm field can also contain an index pointing to the file descriptor in the attr.fd_array array. However, if the file descriptor is invalid, the verifier still prints the verbose message containing value of insn[0].imm. Patch the verifier message to always print the actual file descriptor value. Fixes: 387544bfa291 ("bpf: Introduce fd_idx") Signed-off-by: Anton Protopopov <aspsk@isovalent.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20240412141100.3562942-1-aspsk@isovalent.com
2024-04-12Merge tag 'trace-v6.9-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix the buffer_percent accounting as it is dependent on three variables: 1) pages_read - number of subbuffers read 2) pages_lost - number of subbuffers lost due to overwrite 3) pages_touched - number of pages that a writer entered These three counters only increment, and to know how many active pages there are on the buffer at any given time, the pages_read and pages_lost are subtracted from pages_touched. But the pages touched was incremented whenever any writer went to the next subbuffer even if it wasn't the only one, so it was incremented more than it should be causing the counter for how many subbuffers currently have content incorrect, which caused the buffer_percent that holds waiters until the ring buffer is filled to a given percentage to wake up early. - Fix warning of unused functions when PERF_EVENTS is not configured in - Replace bad tab with space in Kconfig for FTRACE_RECORD_RECURSION_SIZE - Fix to some kerneldoc function comments in eventfs code. * tag 'trace-v6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: ring-buffer: Only update pages_touched when a new page is touched tracing: hide unused ftrace_event_id_fops tracing: Fix FTRACE_RECORD_RECURSION_SIZE Kconfig entry eventfs: Fix kernel-doc comments to functions
2024-04-12watchdog/softlockup: Report the most frequent interruptsBitao Hu
When the watchdog determines that the current soft lockup is due to an interrupt storm based on CPU utilization, reporting the most frequent interrupts could be good enough for further troubleshooting. Below is an example of interrupt storm. The call tree does not provide useful information, but analyzing which interrupt caused the soft lockup by comparing the counts of interrupts during the lockup period allows to identify the culprit. [ 638.870231] watchdog: BUG: soft lockup - CPU#9 stuck for 26s! [swapper/9:0] [ 638.870825] CPU#9 Utilization every 4s during lockup: [ 638.871194] #1: 0% system, 0% softirq, 100% hardirq, 0% idle [ 638.871652] #2: 0% system, 0% softirq, 100% hardirq, 0% idle [ 638.872107] #3: 0% system, 0% softirq, 100% hardirq, 0% idle [ 638.872563] #4: 0% system, 0% softirq, 100% hardirq, 0% idle [ 638.873018] #5: 0% system, 0% softirq, 100% hardirq, 0% idle [ 638.873494] CPU#9 Detect HardIRQ Time exceeds 50%. Most frequent HardIRQs: [ 638.873994] #1: 330945 irq#7 [ 638.874236] #2: 31 irq#82 [ 638.874493] #3: 10 irq#10 [ 638.874744] #4: 2 irq#89 [ 638.874992] #5: 1 irq#102 ... [ 638.875313] Call trace: [ 638.875315] __do_softirq+0xa8/0x364 Signed-off-by: Bitao Hu <yaoma@linux.alibaba.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Liu Song <liusong@linux.alibaba.com> Reviewed-by: Douglas Anderson <dianders@chromium.org> Link: https://lore.kernel.org/r/20240411074134.30922-6-yaoma@linux.alibaba.com
2024-04-12watchdog/softlockup: Low-overhead detection of interrupt stormBitao Hu
The following softlockup is caused by interrupt storm, but it cannot be identified from the call tree. Because the call tree is just a snapshot and doesn't fully capture the behavior of the CPU during the soft lockup. watchdog: BUG: soft lockup - CPU#28 stuck for 23s! [fio:83921] ... Call trace: __do_softirq+0xa0/0x37c __irq_exit_rcu+0x108/0x140 irq_exit+0x14/0x20 __handle_domain_irq+0x84/0xe0 gic_handle_irq+0x80/0x108 el0_irq_naked+0x50/0x58 Therefore, it is necessary to report CPU utilization during the softlockup_threshold period (report once every sample_period, for a total of 5 reportings), like this: watchdog: BUG: soft lockup - CPU#28 stuck for 23s! [fio:83921] CPU#28 Utilization every 4s during lockup: #1: 0% system, 0% softirq, 100% hardirq, 0% idle #2: 0% system, 0% softirq, 100% hardirq, 0% idle #3: 0% system, 0% softirq, 100% hardirq, 0% idle #4: 0% system, 0% softirq, 100% hardirq, 0% idle #5: 0% system, 0% softirq, 100% hardirq, 0% idle ... This is helpful in determining whether an interrupt storm has occurred or in identifying the cause of the softlockup. The criteria for determination are as follows: a. If the hardirq utilization is high, then interrupt storm should be considered and the root cause cannot be determined from the call tree. b. If the softirq utilization is high, then the call might not necessarily point at the root cause. c. If the system utilization is high, then analyzing the root cause from the call tree is possible in most cases. The mechanism requires a considerable amount of global storage space when configured for the maximum number of CPUs. Therefore, adding a SOFTLOCKUP_DETECTOR_INTR_STORM Kconfig knob that defaults to "yes" if the max number of CPUs is <= 128. Signed-off-by: Bitao Hu <yaoma@linux.alibaba.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Liu Song <liusong@linux.alibaba.com> Link: https://lore.kernel.org/r/20240411074134.30922-5-yaoma@linux.alibaba.com
2024-04-12genirq: Avoid summation loops for /proc/interruptsBitao Hu
show_interrupts() unconditionally accumulates the per CPU interrupt statistics to determine whether an interrupt was ever raised. This can be avoided for all interrupts which are not strictly per CPU and not of type NMI because those interrupts provide already an accumulated counter. The required logic is already implemented in kstat_irqs(). Split the inner access logic out of kstat_irqs() and use it for kstat_irqs() and show_interrupts() to avoid the accumulation loop when possible. Originally-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Bitao Hu <yaoma@linux.alibaba.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Liu Song <liusong@linux.alibaba.com> Reviewed-by: Douglas Anderson <dianders@chromium.org> Link: https://lore.kernel.org/r/20240411074134.30922-4-yaoma@linux.alibaba.com
2024-04-12genirq: Provide a snapshot mechanism for interrupt statisticsBitao Hu
The soft lockup detector lacks a mechanism to identify interrupt storms as root cause of a lockup. To enable this the detector needs a mechanism to snapshot the interrupt count statistics on a CPU when the detector observes a potential lockup scenario and compare that against the interrupt count when it warns about the lockup later on. The number of interrupts in that period give a hint whether the lockup might have been caused by an interrupt storm. Instead of having extra storage in the lockup detector and accessing the internals of the interrupt descriptor directly, add a snapshot member to the per CPU irq_desc::kstat_irq structure and provide interfaces to take a snapshot of all interrupts on the current CPU and to retrieve the delta of a specific interrupt later on. Originally-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Bitao Hu <yaoma@linux.alibaba.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240411074134.30922-3-yaoma@linux.alibaba.com
2024-04-12genirq: Convert kstat_irqs to a structBitao Hu
The irq_desc::kstat_irqs member is a per-CPU variable of type int, which is only capable of counting. A snapshot mechanism for interrupt statistics will be added soon, which requires an additional variable to store the snapshot. To facilitate expansion, convert kstat_irqs here to a struct containing only the count. Originally-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Bitao Hu <yaoma@linux.alibaba.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240411074134.30922-2-yaoma@linux.alibaba.com
2024-04-12perf/bpf: Change the !CONFIG_BPF_SYSCALL stubs to static inlinesIngo Molnar
Otherwise the compiler will be unhappy if they go unused, which they do on allnoconfigs. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Kyle Huey <me@kylehuey.com> Link: https://lore.kernel.org/r/ZhkE9F4dyfR2dH2D@gmail.com
2024-04-12perf/bpf: Allow a BPF program to suppress all sample side effectsKyle Huey
Returning zero from a BPF program attached to a perf event already suppresses any data output. Return early from __perf_event_overflow() in this case so it will also suppress event_limit accounting, SIGTRAP generation, and F_ASYNC signalling. Signed-off-by: Kyle Huey <khuey@kylehuey.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Song Liu <song@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240412015019.7060-7-khuey@kylehuey.com
2024-04-12perf/bpf: Call BPF handler directly, not through overflow machineryKyle Huey
To ultimately allow BPF programs attached to perf events to completely suppress all of the effects of a perf event overflow (rather than just the sample output, as they do today), call bpf_overflow_handler() from __perf_event_overflow() directly rather than modifying struct perf_event's overflow_handler. Return the BPF program's return value from bpf_overflow_handler() so that __perf_event_overflow() knows how to proceed. Remove the now unnecessary orig_overflow_handler from struct perf_event. This patch is solely a refactoring and results in no behavior change. Suggested-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Kyle Huey <khuey@kylehuey.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Song Liu <song@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240412015019.7060-5-khuey@kylehuey.com
2024-04-12perf/bpf: Create bpf_overflow_handler() stub for !CONFIG_BPF_SYSCALLKyle Huey
This will allow __perf_event_overflow() (which is independent of CONFIG_BPF_SYSCALL) to call bpf_overflow_handler(). Signed-off-by: Kyle Huey <khuey@kylehuey.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240412015019.7060-3-khuey@kylehuey.com
2024-04-12perf/bpf: Reorder bpf_overflow_handler() ahead of __perf_event_overflow()Kyle Huey
This will allow __perf_event_overflow() to call bpf_overflow_handler(). Signed-off-by: Kyle Huey <khuey@kylehuey.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240412015019.7060-2-khuey@kylehuey.com
2024-04-12locking/pvqspinlock: Use try_cmpxchg() in qspinlock_paravirt.hUros Bizjak
Use try_cmpxchg(*ptr, &old, new) instead of cmpxchg(*ptr, old, new) == old in qspinlock_paravirt.h x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg. No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Waiman Long <longman@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240411192317.25432-2-ubizjak@gmail.com
2024-04-12locking/pvqspinlock: Use try_cmpxchg_acquire() in trylock_clear_pending()Uros Bizjak
Replace this pattern in trylock_clear_pending(): cmpxchg_acquire(*ptr, old, new) == old ... with the simpler and faster: try_cmpxchg_acquire(*ptr, &old, new) The x86 CMPXCHG instruction returns success in the ZF flag, so this change saves a compare after the CMPXCHG. Also change the return type of the function to bool and streamline the control flow in the _Q_PENDING_BITS == 8 variant a bit. No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Waiman Long <longman@redhat.com> Reviewed-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240325140943.815051-1-ubizjak@gmail.com
2024-04-12ftrace: Choose RCU Tasks based on TASKS_RCU rather than PREEMPTIONPaul E. McKenney
The advent of CONFIG_PREEMPT_AUTO, AKA lazy preemption, will mean that even kernels built with CONFIG_PREEMPT_NONE or CONFIG_PREEMPT_VOLUNTARY might see the occasional preemption, and that this preemption just might happen within a trampoline. Therefore, update ftrace_shutdown() to invoke synchronize_rcu_tasks() based on CONFIG_TASKS_RCU instead of CONFIG_PREEMPTION. [ paulmck: Apply Steven Rostedt feedback. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Ankur Arora <ankur.a.arora@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <linux-trace-kernel@vger.kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2024-04-12bpf: Choose RCU Tasks based on TASKS_RCU rather than PREEMPTIONPaul E. McKenney
The advent of CONFIG_PREEMPT_AUTO, AKA lazy preemption, will mean that even kernels built with CONFIG_PREEMPT_NONE or CONFIG_PREEMPT_VOLUNTARY might see the occasional preemption, and that this preemption just might happen within a trampoline. Therefore, update bpf_tramp_image_put() to choose call_rcu_tasks() based on CONFIG_TASKS_RCU instead of CONFIG_PREEMPTION. This change might enable further simplifications, but the goal of this effort is to make the code safe, not necessarily optimal. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Song Liu <song@kernel.org> Cc: Yonghong Song <yonghong.song@linux.dev> Cc: KP Singh <kpsingh@kernel.org> Cc: Stanislav Fomichev <sdf@google.com> Cc: Hao Luo <haoluo@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Ankur Arora <ankur.a.arora@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <bpf@vger.kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2024-04-12mm: replace set_pte_at_notify() with just set_pte_at()Paolo Bonzini
With the demise of the .change_pte() MMU notifier callback, there is no notification happening in set_pte_at_notify(). It is a synonym of set_pte_at() and can be replaced with it. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Message-ID: <20240405115815.3226315-5-pbonzini@redhat.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-12padata: Disable BH when taking works lock on MT pathHerbert Xu
As the old padata code can execute in softirq context, disable softirqs for the new padata_do_mutithreaded code too as otherwise lockdep will get antsy. Reported-by: syzbot+0cb5bb0f4bf9e79db3b3@syzkaller.appspotmail.com Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-11ring-buffer: Only update pages_touched when a new page is touchedSteven Rostedt (Google)
The "buffer_percent" logic that is used by the ring buffer splice code to only wake up the tasks when there's no data after the buffer is filled to the percentage of the "buffer_percent" file is dependent on three variables that determine the amount of data that is in the ring buffer: 1) pages_read - incremented whenever a new sub-buffer is consumed 2) pages_lost - incremented every time a writer overwrites a sub-buffer 3) pages_touched - incremented when a write goes to a new sub-buffer The percentage is the calculation of: (pages_touched - (pages_lost + pages_read)) / nr_pages Basically, the amount of data is the total number of sub-bufs that have been touched, minus the number of sub-bufs lost and sub-bufs consumed. This is divided by the total count to give the buffer percentage. When the percentage is greater than the value in the "buffer_percent" file, it wakes up splice readers waiting for that amount. It was observed that over time, the amount read from the splice was constantly decreasing the longer the trace was running. That is, if one asked for 60%, it would read over 60% when it first starts tracing, but then it would be woken up at under 60% and would slowly decrease the amount of data read after being woken up, where the amount becomes much less than the buffer percent. This was due to an accounting of the pages_touched incrementation. This value is incremented whenever a writer transfers to a new sub-buffer. But the place where it was incremented was incorrect. If a writer overflowed the current sub-buffer it would go to the next one. If it gets preempted by an interrupt at that time, and the interrupt performs a trace, it too will end up going to the next sub-buffer. But only one should increment the counter. Unfortunately, that was not the case. Change the cmpxchg() that does the real switch of the tail-page into a try_cmpxchg(), and on success, perform the increment of pages_touched. This will only increment the counter once for when the writer moves to a new sub-buffer, and not when there's a race and is incremented for when a writer and its preempting writer both move to the same new sub-buffer. Link: https://lore.kernel.org/linux-trace-kernel/20240409151309.0d0e5056@gandalf.local.home Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: 2c2b0a78b3739 ("ring-buffer: Add percentage of ring buffer full to wake up reader") Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-04-11tracing: hide unused ftrace_event_id_fopsArnd Bergmann
When CONFIG_PERF_EVENTS, a 'make W=1' build produces a warning about the unused ftrace_event_id_fops variable: kernel/trace/trace_events.c:2155:37: error: 'ftrace_event_id_fops' defined but not used [-Werror=unused-const-variable=] 2155 | static const struct file_operations ftrace_event_id_fops = { Hide this in the same #ifdef as the reference to it. Link: https://lore.kernel.org/linux-trace-kernel/20240403080702.3509288-7-arnd@kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Zheng Yejian <zhengyejian1@huawei.com> Cc: Kees Cook <keescook@chromium.org> Cc: Ajay Kaher <akaher@vmware.com> Cc: Jinjie Ruan <ruanjinjie@huawei.com> Cc: Clément Léger <cleger@rivosinc.com> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: "Tzvetomir Stoyanov (VMware)" <tz.stoyanov@gmail.com> Fixes: 620a30e97feb ("tracing: Don't pass file_operations array to event_create_dir()") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-04-11tracing: Fix FTRACE_RECORD_RECURSION_SIZE Kconfig entryPrasad Pandit
Fix FTRACE_RECORD_RECURSION_SIZE entry, replace tab with a space character. It helps Kconfig parsers to read file without error. Link: https://lore.kernel.org/linux-trace-kernel/20240322121801.1803948-1-ppandit@redhat.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: 773c16705058 ("ftrace: Add recording of functions that caused recursion") Signed-off-by: Prasad Pandit <pjp@fedoraproject.org> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-04-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. Conflicts: net/unix/garbage.c 47d8ac011fe1 ("af_unix: Fix garbage collector racing against connect()") 4090fa373f0e ("af_unix: Replace garbage collection algorithm.") Adjacent changes: drivers/net/ethernet/broadcom/bnxt/bnxt.c faa12ca24558 ("bnxt_en: Reset PTP tx_avail after possible firmware reset") b3d0083caf9a ("bnxt_en: Support RSS contexts in ethtool .{get|set}_rxfh()") drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c 7ac10c7d728d ("bnxt_en: Fix possible memory leak in bnxt_rdma_aux_device_init()") 194fad5b2781 ("bnxt_en: Refactor bnxt_rdma_aux_device_init/uninit functions") drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c 958f56e48385 ("net/mlx5e: Un-expose functions in en.h") 49e6c9387051 ("net/mlx5e: RSS, Block XOR hash with over 128 channels") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-11Merge tag 'pm-6.9-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fix from Rafael Wysocki: "Fix the suspend-to-idle core code to guarantee that timers queued on CPUs other than the one that has first left the idle state, which should expire directly after resume, will be handled (Anna-Maria Behnsen)" * tag 'pm-6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM: s2idle: Make sure CPUs will wakeup directly on resume
2024-04-11locking/mutex: Introduce devm_mutex_init()George Stark
Using of devm API leads to a certain order of releasing resources. So all dependent resources which are not devm-wrapped should be deleted with respect to devm-release order. Mutex is one of such objects that often is bound to other resources and has no own devm wrapping. Since mutex_destroy() actually does nothing in non-debug builds frequently calling mutex_destroy() is just ignored which is safe for now but wrong formally and can lead to a problem if mutex_destroy() will be extended so introduce devm_mutex_init(). Suggested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: George Stark <gnstark@salutedevices.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Reviewed-by: Marek Behún <kabel@kernel.org> Acked-by: Waiman Long <longman@redhat.com> Link: https://lore.kernel.org/r/20240411161032.609544-2-gnstark@salutedevices.com Signed-off-by: Lee Jones <lee@kernel.org>
2024-04-11tracing: Select new NEED_TASKS_RCU Kconfig optionPaul E. McKenney
Currently, if a Kconfig option depends on TASKS_RCU, it conditionally does "select TASKS_RCU if PREEMPTION". This works, but requires any change in this enablement logic to be replicated across all such "select" clauses. A new NEED_TASKS_RCU Kconfig option has been created to allow this enablement logic to be in one place in kernel/rcu/Kconfig. Therefore, select the new NEED_TASKS_RCU Kconfig option instead of the old TASKS_RCU option. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: <linux-trace-kernel@vger.kernel.org> Cc: Ankur Arora <ankur.a.arora@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Acked-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2024-04-11rcu: Add data structures for synchronize_rcu()Uladzislau Rezki (Sony)
The synchronize_rcu() call is going to be reworked, thus this patch adds dedicated fields into the rcu_state structure. Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2024-04-11treewide: Use sysfs_bin_attr_simple_read() helperLukas Wunner
Deduplicate ->read() callbacks of bin_attributes which are backed by a simple buffer in memory: Use the newly introduced sysfs_bin_attr_simple_read() helper instead, either by referencing it directly or by declaring such bin_attributes with BIN_ATTR_SIMPLE_RO() or BIN_ATTR_SIMPLE_ADMIN_RO(). Aside from a reduction of LoC, this shaves off a few bytes from vmlinux (304 bytes on an x86_64 allyesconfig). No functional change intended. Signed-off-by: Lukas Wunner <lukas@wunner.de> Acked-by: Zhi Wang <zhiwang@kernel.org> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/92ee0a0e83a5a3f3474845db6c8575297698933a.1712410202.git.lukas@wunner.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-04-11locking/qspinlock: Use atomic_try_cmpxchg_relaxed() in xchg_tail()Uros Bizjak
Use atomic_try_cmpxchg_relaxed(*ptr, &old, new) instead of atomic_cmpxchg_relaxed (*ptr, old, new) == old in xchg_tail(). x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after CMPXCHG. No functional change intended. Since this code requires NR_CPUS >= 16k, I have tested it by unconditionally setting _Q_PENDING_BITS to 1 in <asm-generic/qspinlock_types.h>. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Waiman Long <longman@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240321195309.484275-1-ubizjak@gmail.com
2024-04-11printk: Add function to replay kernel log on consolesSreenath Vijayan
Add a generic function console_replay_all() for replaying the kernel log on consoles, in any context. It would allow viewing the logs on an unresponsive terminal via sysrq. Reuse the existing code from console_flush_on_panic() for resetting the sequence numbers, by introducing a new helper function __console_rewind_all(). It is safe to be called under console_lock(). Try to acquire lock on the console subsystem without waiting. If successful, reset the sequence number to oldest available record on all consoles and call console_unlock() which will automatically flush the messages to the consoles. Suggested-by: John Ogness <john.ogness@linutronix.de> Suggested-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Shimoyashiki Taichi <taichi.shimoyashiki@sony.com> Reviewed-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Sreenath Vijayan <sreenath.vijayan@sony.com> Link: https://lore.kernel.org/r/90ee131c643a5033d117b556c0792de65129d4c3.1710220326.git.sreenath.vijayan@sony.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-04-10bpf: Add bpf_link support for sk_msg and sk_skb progsYonghong Song
Add bpf_link support for sk_msg and sk_skb programs. We have an internal request to support bpf_link for sk_msg programs so user space can have a uniform handling with bpf_link based libbpf APIs. Using bpf_link based libbpf API also has a benefit which makes system robust by decoupling prog life cycle and attachment life cycle. Reviewed-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20240410043527.3737160-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-04-10kprobes: Fix possible use-after-free issue on kprobe registrationZheng Yejian
When unloading a module, its state is changing MODULE_STATE_LIVE -> MODULE_STATE_GOING -> MODULE_STATE_UNFORMED. Each change will take a time. `is_module_text_address()` and `__module_text_address()` works with MODULE_STATE_LIVE and MODULE_STATE_GOING. If we use `is_module_text_address()` and `__module_text_address()` separately, there is a chance that the first one is succeeded but the next one is failed because module->state becomes MODULE_STATE_UNFORMED between those operations. In `check_kprobe_address_safe()`, if the second `__module_text_address()` is failed, that is ignored because it expected a kernel_text address. But it may have failed simply because module->state has been changed to MODULE_STATE_UNFORMED. In this case, arm_kprobe() will try to modify non-exist module text address (use-after-free). To fix this problem, we should not use separated `is_module_text_address()` and `__module_text_address()`, but use only `__module_text_address()` once and do `try_module_get(module)` which is only available with MODULE_STATE_LIVE. Link: https://lore.kernel.org/all/20240410015802.265220-1-zhengyejian1@huawei.com/ Fixes: 28f6c37a2910 ("kprobes: Forbid probing on trampoline and BPF code areas") Cc: stable@vger.kernel.org Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>