Age | Commit message (Collapse) | Author |
|
Because the Tasks RCU ->rtp_exit_list is initialized at rcu_init()
time while there is only one CPU running with interrupts disabled, it
is not possible for an exiting task to encounter an uninitialized list.
This commit therefore replaces the conditional initialization with
a WARN_ON_ONCE().
Reported-by: Frederic Weisbecker <frederic@kernel.org>
Closes: https://lore.kernel.org/all/ZdiNXmO3wRvmzPsr@lothringen/
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
|
|
Tasks Trace RCU needs a single-byte cmpxchg(), but no such thing exists.
Therefore, rcu_trc_cmpxchg_need_qs() emulates one using field substitution
and a four-byte cmpxchg(), such that the other three bytes are always
atomically updated to their old values. This works, but results in
false-positive KCSAN failures because as far as KCSAN knows, this
cmpxchg() operation is updating all four bytes.
This commit therefore encloses the cmpxchg() in a data_race() and adds
a single-byte instrument_atomic_read_write(), thus telling KCSAN exactly
what is going on so as to avoid the false positives.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
|
|
Currently, there are rcu_data structure fields named ->rcu_onl_gp_seq
and ->rcu_ofl_gp_seq that track the rcu_state.gp_flags field at the
time of the corresponding CPU's last online or offline operation,
respectively. However, this information is not particularly useful.
It would be better to instead track the grace period state kept
in rcu_state.gp_state. This would also be consistent with the
initialization in rcu_boot_init_percpu_data(), which is to RCU_GP_CLEANED
(an rcu_state.gp_state value), and also with the diagnostics in
rcu_implicit_dynticks_qs(), whose format is consistent with an integer,
not a bitmask.
This commit therefore makes this change and changes the names to
->rcu_onl_gp_flags and ->rcu_ofl_gp_flags, respectively.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
|
|
The rcu_state.n_online_cpus value is only ever updated by CPU-hotplug
operations, which are serialized. However, this value is read locklessly.
This commit therefore marks those reads. While in the area, it also
adds ASSERT_EXCLUSIVE_WRITER() calls just in case parallel CPU hotplug
becomes a thing.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
|
|
The rcu_sync structure's ->gp_count field is updated under the protection
of ->rss_lock, but read locklessly, and KCSAN noted the data race.
This commit therefore uses WRITE_ONCE() to do this update to clearly
document its racy nature.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
|
|
This commit adds READ_ONCE() to a lockless diagnostic read from
rcu_state.gp_flags to avoid giving the compiler any chance whatsoever
of confusing the diagnostic state printed.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
|
|
Although it is functionally OK to do READ_ONCE() of a variable that
cannot change, it is confusing and at best an accident waiting to happen.
This commit therefore removes a number of READ_ONCE(rcu_state.gp_flags)
instances from kernel/rcu/tree.c that are not needed due to updates
to this field being excluded by virtue of holding the root rcu_node
structure's ->lock.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Closes: https://lore.kernel.org/lkml/4857c5ef-bd8f-4670-87ac-0600a1699d05@paulmck-laptop/T/#mccb23c2a4902da4d3c750165329f8de056903c58
Reported-by: Julia Lawall <julia.lawall@inria.fr>
Closes: https://lore.kernel.org/lkml/4857c5ef-bd8f-4670-87ac-0600a1699d05@paulmck-laptop/T/#md1b5c026584f9c3c7b0fbc9240dd7de584597b73
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
|
|
Because Tiny RCU is used only in kernels built with either
CONFIG_PREEMPT_NONE=y or CONFIG_PREEMPT_VOLUNTARY=y, there has not been
any need for TINY RCU to explicitly disable preemption. However, the
prospect of lazy preemption changes that, and preemption means that
the non-atomic increment in synchronize_rcu() can be preempted, with
the possibility that one of the increments is lost. This could cause
failures for users of the APIs that poll RCU grace periods.
This commit therefore adds the needed preempt_disable() and
preempt_enable() call to Tiny RCU.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Ankur Arora <ankur.a.arora@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
|
|
The TINY_RCU rcu_process_callbacks() function is only ever invoked from
a softirq handler, which means that BH is already disabled. This commit
therefore removes the redundant local_bh_disable() and local_bh_ennable()
from this function.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
|
|
Currently, if a Kconfig option depends on TASKS_RCU, it conditionally does
"select TASKS_RCU if PREEMPTION". This works, but requires any change in
this enablement logic to be replicated across all such "select" clauses.
This commit therefore creates a new NEED_TASKS_RCU Kconfig option so
that the default value of TASKS_RCU can depend on a combination of this
new option and any needed enablement logic, so that this logic is in
one place.
While in the area, also anticipate a likely future change by adding
PREEMPT_AUTO to that logic.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Ankur Arora <ankur.a.arora@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
|
|
Because Tiny SRCU is used only in kernels built with either
CONFIG_PREEMPT_NONE=y or CONFIG_PREEMPT_VOLUNTARY=y, there has not
been any need for TINY SRCU to explicitly disable preemption. However,
the prospect of lazy preemption changes that, and the lazy-preemption
patches do result in rcutorture runs finding both too-short grace periods
and grace-period hangs for Tiny SRCU.
This commit therefore adds the needed preempt_disable() and
preempt_enable() calls to Tiny SRCU.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Ankur Arora <ankur.a.arora@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
|
|
Right now, TINY_RCU depends on (!PREEMPTION && !SMP), which has served the
kernel well for many years due to the fact that PREEMPT_RCU is normally
a synonym for PREEMPTION. But with the advent of lazy preemption,
it will be possible to have non-preemptible RCU in a preemptible kernel,
so that kernels could be built with PREEMPT_RCU=n and PREEMPTION=y.
This commit therefore makes TINY_RCU depend on (!PREEMPT_RCU && !SMP),
thus allowing for a non-preemptible RCU in preemptible kernels.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Ankur Arora <ankur.a.arora@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
|
|
perf_event_free_bpf_handler() as inline too
They can be unused with certain Kconfig variations:
kernel/events/core.c:9622:13: warning: ‘perf_event_free_bpf_handler’ defined but not used [-Wunused-function]
kernel/events/core.c:9586:12: warning: ‘perf_event_set_bpf_handler’ defined but not used [-Wunused-function]
Since they are both single-use, mark them inline.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: linux-kernel@vger.kernel.org
Cc: Kyle Huey <khuey@kylehuey.com>
|
|
perf_output_wakeup() already marks the perf event fd available for polling.
Trigger IO signals with FASYNC too.
Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240413141618.4160-3-khuey@kylehuey.com
|
|
This will allow it to be called from perf_output_wakeup().
Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240413141618.4160-2-khuey@kylehuey.com
|
|
Pick up perf/urgent fixes that are upstream already, but not
yet in the perf/core development branch.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 fixes from Ingo Molnar:
- Follow up fixes for the BHI mitigations code
- Fix !SPECULATION_MITIGATIONS bug not turning off mitigations as
expected
- Work around an APIC emulation bug when the kernel is built with Clang
and run as a SEV guest
- Follow up x86 topology fixes
* tag 'x86-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu/amd: Move TOPOEXT enablement into the topology parser
x86/cpu/amd: Make the NODEID_MSR union actually work
x86/cpu/amd: Make the CPUID 0x80000008 parser correct
x86/bugs: Replace CONFIG_SPECTRE_BHI_{ON,OFF} with CONFIG_MITIGATION_SPECTRE_BHI
x86/bugs: Remove CONFIG_BHI_MITIGATION_AUTO and spectre_bhi=auto
x86/bugs: Clarify that syscall hardening isn't a BHI mitigation
x86/bugs: Fix BHI handling of RRSBA
x86/bugs: Rename various 'ia32_cap' variables to 'x86_arch_cap_msr'
x86/bugs: Cache the value of MSR_IA32_ARCH_CAPABILITIES
x86/bugs: Fix BHI documentation
x86/cpu: Actually turn off mitigations by default for SPECULATION_MITIGATIONS=n
x86/topology: Don't update cpu_possible_map in topo_set_cpuids()
x86/bugs: Fix return type of spectre_bhi_state()
x86/apic: Force native_apic_mem_read() to use the MOV instruction
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Ingo Molnar:
- Address a (valid) W=1 build warning
- Fix timer self-tests
- Annotate a KCSAN warning wrt. accesses to the tick_do_timer_cpu
global variable
- Address a !CONFIG_BUG build warning
* tag 'timers-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
selftests: kselftest: Fix build failure with NOLIBC
selftests: timers: Fix abs() warning in posix_timers test
selftests: kselftest: Mark functions that unconditionally call exit() as __noreturn
selftests: timers: Fix posix_timers ksft_print_msg() warning
selftests: timers: Fix valid-adjtimex signed left-shift undefined behavior
bug: Fix no-return-statement warning with !CONFIG_BUG
timekeeping: Use READ/WRITE_ONCE() for tick_do_timer_cpu
selftests/timers/posix_timers: Reimplement check_timer_distribution()
irqflags: Explicitly ignore lockdep_hrtimer_exit() argument
|
|
git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fixes from Christoph Hellwig:
- fix up swiotlb buffer padding even more (Petr Tesarik)
- fix for partial dma_sync on swiotlb (Michael Kelley)
- swiotlb debugfs fix (Dexuan Cui)
* tag 'dma-maping-6.9-2024-04-14' of git://git.infradead.org/users/hch/dma-mapping:
swiotlb: do not set total_used to 0 in swiotlb_create_debugfs_files()
swiotlb: fix swiotlb_bounce() to do partial sync's correctly
swiotlb: extend buffer pre-padding to alloc_align_mask if necessary
|
|
Long ago a map file descriptor in a pseudo ldimm64 instruction could
only be present as an immediate value insn[0].imm, and thus this value
was used in a verbose verifier message printed when the file descriptor
wasn't valid. Since addition of BPF_PSEUDO_MAP_IDX_VALUE/BPF_PSEUDO_MAP_IDX
the insn[0].imm field can also contain an index pointing to the file
descriptor in the attr.fd_array array. However, if the file descriptor
is invalid, the verifier still prints the verbose message containing
value of insn[0].imm. Patch the verifier message to always print the
actual file descriptor value.
Fixes: 387544bfa291 ("bpf: Introduce fd_idx")
Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240412141100.3562942-1-aspsk@isovalent.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix the buffer_percent accounting as it is dependent on three
variables:
1) pages_read - number of subbuffers read
2) pages_lost - number of subbuffers lost due to overwrite
3) pages_touched - number of pages that a writer entered
These three counters only increment, and to know how many active
pages there are on the buffer at any given time, the pages_read and
pages_lost are subtracted from pages_touched.
But the pages touched was incremented whenever any writer went to the
next subbuffer even if it wasn't the only one, so it was incremented
more than it should be causing the counter for how many subbuffers
currently have content incorrect, which caused the buffer_percent
that holds waiters until the ring buffer is filled to a given
percentage to wake up early.
- Fix warning of unused functions when PERF_EVENTS is not configured in
- Replace bad tab with space in Kconfig for FTRACE_RECORD_RECURSION_SIZE
- Fix to some kerneldoc function comments in eventfs code.
* tag 'trace-v6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
ring-buffer: Only update pages_touched when a new page is touched
tracing: hide unused ftrace_event_id_fops
tracing: Fix FTRACE_RECORD_RECURSION_SIZE Kconfig entry
eventfs: Fix kernel-doc comments to functions
|
|
When the watchdog determines that the current soft lockup is due to an
interrupt storm based on CPU utilization, reporting the most frequent
interrupts could be good enough for further troubleshooting.
Below is an example of interrupt storm. The call tree does not provide
useful information, but analyzing which interrupt caused the soft lockup by
comparing the counts of interrupts during the lockup period allows to
identify the culprit.
[ 638.870231] watchdog: BUG: soft lockup - CPU#9 stuck for 26s! [swapper/9:0]
[ 638.870825] CPU#9 Utilization every 4s during lockup:
[ 638.871194] #1: 0% system, 0% softirq, 100% hardirq, 0% idle
[ 638.871652] #2: 0% system, 0% softirq, 100% hardirq, 0% idle
[ 638.872107] #3: 0% system, 0% softirq, 100% hardirq, 0% idle
[ 638.872563] #4: 0% system, 0% softirq, 100% hardirq, 0% idle
[ 638.873018] #5: 0% system, 0% softirq, 100% hardirq, 0% idle
[ 638.873494] CPU#9 Detect HardIRQ Time exceeds 50%. Most frequent HardIRQs:
[ 638.873994] #1: 330945 irq#7
[ 638.874236] #2: 31 irq#82
[ 638.874493] #3: 10 irq#10
[ 638.874744] #4: 2 irq#89
[ 638.874992] #5: 1 irq#102
...
[ 638.875313] Call trace:
[ 638.875315] __do_softirq+0xa8/0x364
Signed-off-by: Bitao Hu <yaoma@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Liu Song <liusong@linux.alibaba.com>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20240411074134.30922-6-yaoma@linux.alibaba.com
|
|
The following softlockup is caused by interrupt storm, but it cannot be
identified from the call tree. Because the call tree is just a snapshot
and doesn't fully capture the behavior of the CPU during the soft lockup.
watchdog: BUG: soft lockup - CPU#28 stuck for 23s! [fio:83921]
...
Call trace:
__do_softirq+0xa0/0x37c
__irq_exit_rcu+0x108/0x140
irq_exit+0x14/0x20
__handle_domain_irq+0x84/0xe0
gic_handle_irq+0x80/0x108
el0_irq_naked+0x50/0x58
Therefore, it is necessary to report CPU utilization during the
softlockup_threshold period (report once every sample_period, for a total
of 5 reportings), like this:
watchdog: BUG: soft lockup - CPU#28 stuck for 23s! [fio:83921]
CPU#28 Utilization every 4s during lockup:
#1: 0% system, 0% softirq, 100% hardirq, 0% idle
#2: 0% system, 0% softirq, 100% hardirq, 0% idle
#3: 0% system, 0% softirq, 100% hardirq, 0% idle
#4: 0% system, 0% softirq, 100% hardirq, 0% idle
#5: 0% system, 0% softirq, 100% hardirq, 0% idle
...
This is helpful in determining whether an interrupt storm has occurred or
in identifying the cause of the softlockup. The criteria for determination
are as follows:
a. If the hardirq utilization is high, then interrupt storm should be
considered and the root cause cannot be determined from the call tree.
b. If the softirq utilization is high, then the call might not necessarily
point at the root cause.
c. If the system utilization is high, then analyzing the root
cause from the call tree is possible in most cases.
The mechanism requires a considerable amount of global storage space
when configured for the maximum number of CPUs. Therefore, adding a
SOFTLOCKUP_DETECTOR_INTR_STORM Kconfig knob that defaults to "yes"
if the max number of CPUs is <= 128.
Signed-off-by: Bitao Hu <yaoma@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Liu Song <liusong@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240411074134.30922-5-yaoma@linux.alibaba.com
|
|
show_interrupts() unconditionally accumulates the per CPU interrupt
statistics to determine whether an interrupt was ever raised.
This can be avoided for all interrupts which are not strictly per CPU
and not of type NMI because those interrupts provide already an
accumulated counter. The required logic is already implemented in
kstat_irqs().
Split the inner access logic out of kstat_irqs() and use it for
kstat_irqs() and show_interrupts() to avoid the accumulation loop
when possible.
Originally-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Bitao Hu <yaoma@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Liu Song <liusong@linux.alibaba.com>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20240411074134.30922-4-yaoma@linux.alibaba.com
|
|
The soft lockup detector lacks a mechanism to identify interrupt storms as
root cause of a lockup. To enable this the detector needs a mechanism to
snapshot the interrupt count statistics on a CPU when the detector observes
a potential lockup scenario and compare that against the interrupt count
when it warns about the lockup later on. The number of interrupts in that
period give a hint whether the lockup might have been caused by an interrupt
storm.
Instead of having extra storage in the lockup detector and accessing the
internals of the interrupt descriptor directly, add a snapshot member to
the per CPU irq_desc::kstat_irq structure and provide interfaces to take a
snapshot of all interrupts on the current CPU and to retrieve the delta of
a specific interrupt later on.
Originally-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Bitao Hu <yaoma@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240411074134.30922-3-yaoma@linux.alibaba.com
|
|
The irq_desc::kstat_irqs member is a per-CPU variable of type int, which is
only capable of counting. A snapshot mechanism for interrupt statistics
will be added soon, which requires an additional variable to store the
snapshot.
To facilitate expansion, convert kstat_irqs here to a struct containing
only the count.
Originally-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Bitao Hu <yaoma@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240411074134.30922-2-yaoma@linux.alibaba.com
|
|
Otherwise the compiler will be unhappy if they go unused,
which they do on allnoconfigs.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Kyle Huey <me@kylehuey.com>
Link: https://lore.kernel.org/r/ZhkE9F4dyfR2dH2D@gmail.com
|
|
Returning zero from a BPF program attached to a perf event already
suppresses any data output. Return early from __perf_event_overflow() in
this case so it will also suppress event_limit accounting, SIGTRAP
generation, and F_ASYNC signalling.
Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240412015019.7060-7-khuey@kylehuey.com
|
|
To ultimately allow BPF programs attached to perf events to completely
suppress all of the effects of a perf event overflow (rather than just the
sample output, as they do today), call bpf_overflow_handler() from
__perf_event_overflow() directly rather than modifying struct perf_event's
overflow_handler. Return the BPF program's return value from
bpf_overflow_handler() so that __perf_event_overflow() knows how to
proceed. Remove the now unnecessary orig_overflow_handler from struct
perf_event.
This patch is solely a refactoring and results in no behavior change.
Suggested-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240412015019.7060-5-khuey@kylehuey.com
|
|
This will allow __perf_event_overflow() (which is independent of
CONFIG_BPF_SYSCALL) to call bpf_overflow_handler().
Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240412015019.7060-3-khuey@kylehuey.com
|
|
This will allow __perf_event_overflow() to call bpf_overflow_handler().
Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240412015019.7060-2-khuey@kylehuey.com
|
|
Use try_cmpxchg(*ptr, &old, new) instead of
cmpxchg(*ptr, old, new) == old in qspinlock_paravirt.h
x86 CMPXCHG instruction returns success in ZF flag, so
this change saves a compare after cmpxchg.
No functional change intended.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20240411192317.25432-2-ubizjak@gmail.com
|
|
Replace this pattern in trylock_clear_pending():
cmpxchg_acquire(*ptr, old, new) == old
... with the simpler and faster:
try_cmpxchg_acquire(*ptr, &old, new)
The x86 CMPXCHG instruction returns success in the ZF flag, so this change
saves a compare after the CMPXCHG.
Also change the return type of the function to bool and streamline
the control flow in the _Q_PENDING_BITS == 8 variant a bit.
No functional change intended.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Reviewed-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20240325140943.815051-1-ubizjak@gmail.com
|
|
The advent of CONFIG_PREEMPT_AUTO, AKA lazy preemption, will mean that
even kernels built with CONFIG_PREEMPT_NONE or CONFIG_PREEMPT_VOLUNTARY
might see the occasional preemption, and that this preemption just might
happen within a trampoline.
Therefore, update ftrace_shutdown() to invoke synchronize_rcu_tasks()
based on CONFIG_TASKS_RCU instead of CONFIG_PREEMPTION.
[ paulmck: Apply Steven Rostedt feedback. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Ankur Arora <ankur.a.arora@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: <linux-trace-kernel@vger.kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
|
|
The advent of CONFIG_PREEMPT_AUTO, AKA lazy preemption, will mean that
even kernels built with CONFIG_PREEMPT_NONE or CONFIG_PREEMPT_VOLUNTARY
might see the occasional preemption, and that this preemption just might
happen within a trampoline.
Therefore, update bpf_tramp_image_put() to choose call_rcu_tasks()
based on CONFIG_TASKS_RCU instead of CONFIG_PREEMPTION.
This change might enable further simplifications, but the goal of this
effort is to make the code safe, not necessarily optimal.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Song Liu <song@kernel.org>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Cc: Hao Luo <haoluo@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Ankur Arora <ankur.a.arora@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: <bpf@vger.kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
|
|
With the demise of the .change_pte() MMU notifier callback, there is no
notification happening in set_pte_at_notify(). It is a synonym of
set_pte_at() and can be replaced with it.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-ID: <20240405115815.3226315-5-pbonzini@redhat.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
As the old padata code can execute in softirq context, disable
softirqs for the new padata_do_mutithreaded code too as otherwise
lockdep will get antsy.
Reported-by: syzbot+0cb5bb0f4bf9e79db3b3@syzkaller.appspotmail.com
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
The "buffer_percent" logic that is used by the ring buffer splice code to
only wake up the tasks when there's no data after the buffer is filled to
the percentage of the "buffer_percent" file is dependent on three
variables that determine the amount of data that is in the ring buffer:
1) pages_read - incremented whenever a new sub-buffer is consumed
2) pages_lost - incremented every time a writer overwrites a sub-buffer
3) pages_touched - incremented when a write goes to a new sub-buffer
The percentage is the calculation of:
(pages_touched - (pages_lost + pages_read)) / nr_pages
Basically, the amount of data is the total number of sub-bufs that have been
touched, minus the number of sub-bufs lost and sub-bufs consumed. This is
divided by the total count to give the buffer percentage. When the
percentage is greater than the value in the "buffer_percent" file, it
wakes up splice readers waiting for that amount.
It was observed that over time, the amount read from the splice was
constantly decreasing the longer the trace was running. That is, if one
asked for 60%, it would read over 60% when it first starts tracing, but
then it would be woken up at under 60% and would slowly decrease the
amount of data read after being woken up, where the amount becomes much
less than the buffer percent.
This was due to an accounting of the pages_touched incrementation. This
value is incremented whenever a writer transfers to a new sub-buffer. But
the place where it was incremented was incorrect. If a writer overflowed
the current sub-buffer it would go to the next one. If it gets preempted
by an interrupt at that time, and the interrupt performs a trace, it too
will end up going to the next sub-buffer. But only one should increment
the counter. Unfortunately, that was not the case.
Change the cmpxchg() that does the real switch of the tail-page into a
try_cmpxchg(), and on success, perform the increment of pages_touched. This
will only increment the counter once for when the writer moves to a new
sub-buffer, and not when there's a race and is incremented for when a
writer and its preempting writer both move to the same new sub-buffer.
Link: https://lore.kernel.org/linux-trace-kernel/20240409151309.0d0e5056@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: 2c2b0a78b3739 ("ring-buffer: Add percentage of ring buffer full to wake up reader")
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
When CONFIG_PERF_EVENTS, a 'make W=1' build produces a warning about the
unused ftrace_event_id_fops variable:
kernel/trace/trace_events.c:2155:37: error: 'ftrace_event_id_fops' defined but not used [-Werror=unused-const-variable=]
2155 | static const struct file_operations ftrace_event_id_fops = {
Hide this in the same #ifdef as the reference to it.
Link: https://lore.kernel.org/linux-trace-kernel/20240403080702.3509288-7-arnd@kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Zheng Yejian <zhengyejian1@huawei.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Ajay Kaher <akaher@vmware.com>
Cc: Jinjie Ruan <ruanjinjie@huawei.com>
Cc: Clément Léger <cleger@rivosinc.com>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Cc: "Tzvetomir Stoyanov (VMware)" <tz.stoyanov@gmail.com>
Fixes: 620a30e97feb ("tracing: Don't pass file_operations array to event_create_dir()")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Fix FTRACE_RECORD_RECURSION_SIZE entry, replace tab with
a space character. It helps Kconfig parsers to read file
without error.
Link: https://lore.kernel.org/linux-trace-kernel/20240322121801.1803948-1-ppandit@redhat.com
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Fixes: 773c16705058 ("ftrace: Add recording of functions that caused recursion")
Signed-off-by: Prasad Pandit <pjp@fedoraproject.org>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Cross-merge networking fixes after downstream PR.
Conflicts:
net/unix/garbage.c
47d8ac011fe1 ("af_unix: Fix garbage collector racing against connect()")
4090fa373f0e ("af_unix: Replace garbage collection algorithm.")
Adjacent changes:
drivers/net/ethernet/broadcom/bnxt/bnxt.c
faa12ca24558 ("bnxt_en: Reset PTP tx_avail after possible firmware reset")
b3d0083caf9a ("bnxt_en: Support RSS contexts in ethtool .{get|set}_rxfh()")
drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c
7ac10c7d728d ("bnxt_en: Fix possible memory leak in bnxt_rdma_aux_device_init()")
194fad5b2781 ("bnxt_en: Refactor bnxt_rdma_aux_device_init/uninit functions")
drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
958f56e48385 ("net/mlx5e: Un-expose functions in en.h")
49e6c9387051 ("net/mlx5e: RSS, Block XOR hash with over 128 channels")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fix from Rafael Wysocki:
"Fix the suspend-to-idle core code to guarantee that timers queued on
CPUs other than the one that has first left the idle state, which
should expire directly after resume, will be handled (Anna-Maria
Behnsen)"
* tag 'pm-6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: s2idle: Make sure CPUs will wakeup directly on resume
|
|
Using of devm API leads to a certain order of releasing resources.
So all dependent resources which are not devm-wrapped should be deleted
with respect to devm-release order. Mutex is one of such objects that
often is bound to other resources and has no own devm wrapping.
Since mutex_destroy() actually does nothing in non-debug builds
frequently calling mutex_destroy() is just ignored which is safe for now
but wrong formally and can lead to a problem if mutex_destroy() will be
extended so introduce devm_mutex_init().
Suggested-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: George Stark <gnstark@salutedevices.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Reviewed-by: Marek Behún <kabel@kernel.org>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/r/20240411161032.609544-2-gnstark@salutedevices.com
Signed-off-by: Lee Jones <lee@kernel.org>
|
|
Currently, if a Kconfig option depends on TASKS_RCU, it conditionally does
"select TASKS_RCU if PREEMPTION". This works, but requires any change in
this enablement logic to be replicated across all such "select" clauses.
A new NEED_TASKS_RCU Kconfig option has been created to allow this
enablement logic to be in one place in kernel/rcu/Kconfig.
Therefore, select the new NEED_TASKS_RCU Kconfig option instead of the
old TASKS_RCU option.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: <linux-trace-kernel@vger.kernel.org>
Cc: Ankur Arora <ankur.a.arora@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
|
|
The synchronize_rcu() call is going to be reworked, thus
this patch adds dedicated fields into the rcu_state structure.
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
|
|
Deduplicate ->read() callbacks of bin_attributes which are backed by a
simple buffer in memory:
Use the newly introduced sysfs_bin_attr_simple_read() helper instead,
either by referencing it directly or by declaring such bin_attributes
with BIN_ATTR_SIMPLE_RO() or BIN_ATTR_SIMPLE_ADMIN_RO().
Aside from a reduction of LoC, this shaves off a few bytes from vmlinux
(304 bytes on an x86_64 allyesconfig).
No functional change intended.
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Acked-by: Zhi Wang <zhiwang@kernel.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/92ee0a0e83a5a3f3474845db6c8575297698933a.1712410202.git.lukas@wunner.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Use atomic_try_cmpxchg_relaxed(*ptr, &old, new) instead of
atomic_cmpxchg_relaxed (*ptr, old, new) == old in xchg_tail().
x86 CMPXCHG instruction returns success in ZF flag,
so this change saves a compare after CMPXCHG.
No functional change intended.
Since this code requires NR_CPUS >= 16k, I have tested it
by unconditionally setting _Q_PENDING_BITS to 1 in
<asm-generic/qspinlock_types.h>.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20240321195309.484275-1-ubizjak@gmail.com
|
|
Add a generic function console_replay_all() for replaying
the kernel log on consoles, in any context. It would allow
viewing the logs on an unresponsive terminal via sysrq.
Reuse the existing code from console_flush_on_panic() for
resetting the sequence numbers, by introducing a new helper
function __console_rewind_all(). It is safe to be called
under console_lock().
Try to acquire lock on the console subsystem without waiting.
If successful, reset the sequence number to oldest available
record on all consoles and call console_unlock() which will
automatically flush the messages to the consoles.
Suggested-by: John Ogness <john.ogness@linutronix.de>
Suggested-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Shimoyashiki Taichi <taichi.shimoyashiki@sony.com>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Sreenath Vijayan <sreenath.vijayan@sony.com>
Link: https://lore.kernel.org/r/90ee131c643a5033d117b556c0792de65129d4c3.1710220326.git.sreenath.vijayan@sony.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Add bpf_link support for sk_msg and sk_skb programs. We have an
internal request to support bpf_link for sk_msg programs so user
space can have a uniform handling with bpf_link based libbpf
APIs. Using bpf_link based libbpf API also has a benefit which
makes system robust by decoupling prog life cycle and
attachment life cycle.
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240410043527.3737160-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
When unloading a module, its state is changing MODULE_STATE_LIVE ->
MODULE_STATE_GOING -> MODULE_STATE_UNFORMED. Each change will take
a time. `is_module_text_address()` and `__module_text_address()`
works with MODULE_STATE_LIVE and MODULE_STATE_GOING.
If we use `is_module_text_address()` and `__module_text_address()`
separately, there is a chance that the first one is succeeded but the
next one is failed because module->state becomes MODULE_STATE_UNFORMED
between those operations.
In `check_kprobe_address_safe()`, if the second `__module_text_address()`
is failed, that is ignored because it expected a kernel_text address.
But it may have failed simply because module->state has been changed
to MODULE_STATE_UNFORMED. In this case, arm_kprobe() will try to modify
non-exist module text address (use-after-free).
To fix this problem, we should not use separated `is_module_text_address()`
and `__module_text_address()`, but use only `__module_text_address()`
once and do `try_module_get(module)` which is only available with
MODULE_STATE_LIVE.
Link: https://lore.kernel.org/all/20240410015802.265220-1-zhengyejian1@huawei.com/
Fixes: 28f6c37a2910 ("kprobes: Forbid probing on trampoline and BPF code areas")
Cc: stable@vger.kernel.org
Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
|