Age | Commit message (Collapse) | Author |
|
A malicious BPF program may manipulate the branch history to influence
what the hardware speculates will happen next.
On exit from a BPF program, emit the BHB mititgation sequence.
This is only applied for 'classic' cBPF programs that are loaded by
seccomp.
Signed-off-by: James Morse <james.morse@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
|
|
Add a helper to expose the k value of the branchy loop. This is needed
by the BPF JIT to generate the mitigation sequence in BPF programs.
Signed-off-by: James Morse <james.morse@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
is_spectre_bhb_fw_affected() allows the caller to determine if the CPU
is known to need a firmware mitigation. CPUs are either on the list
of CPUs we know about, or firmware has been queried and reported that
the platform is affected - and mitigated by firmware.
This helper is not useful to determine if the platform is mitigated
by firmware. A CPU could be on the know list, but the firmware may
not be implemented. Its affected but not mitigated.
spectre_bhb_enable_mitigation() handles this distinction by checking
the firmware state before enabling the mitigation.
Add a helper to expose this state. This will be used by the BPF JIT
to determine if calling firmware for a mitigation is necessary and
supported.
Signed-off-by: James Morse <james.morse@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
To generate code in the eBPF epilogue that uses the DSB instruction,
insn.c needs a heler to encode the type and domain.
Re-use the crm encoding logic from the DMB instruction.
Signed-off-by: James Morse <james.morse@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
Historically SVE state was discarded deterministically early in the
syscall entry path, before ptrace is notified of syscall entry. This
permitted ptrace to modify SVE state before and after the "real" syscall
logic was executed, with the modified state being retained.
This behaviour was changed by commit:
8c845e2731041f0f ("arm64/sve: Leave SVE enabled on syscall if we don't context switch")
That commit was intended to speed up workloads that used SVE by
opportunistically leaving SVE enabled when returning from a syscall.
The syscall entry logic was modified to truncate the SVE state without
disabling userspace access to SVE, and fpsimd_save_user_state() was
modified to discard userspace SVE state whenever
in_syscall(current_pt_regs()) is true, i.e. when
current_pt_regs()->syscallno != NO_SYSCALL.
Leaving SVE enabled opportunistically resulted in a couple of changes to
userspace visible behaviour which weren't described at the time, but are
logical consequences of opportunistically leaving SVE enabled:
* Signal handlers can observe the type of saved state in the signal's
sve_context record. When the kernel only tracks FPSIMD state, the 'vq'
field is 0 and there is no space allocated for register contents. When
the kernel tracks SVE state, the 'vq' field is non-zero and the
register contents are saved into the record.
As a result of the above commit, 'vq' (and the presence of SVE
register state) is non-deterministically zero or non-zero for a period
of time after a syscall. The effective register state is still
deterministic.
Hopefully no-one relies on this being deterministic. In general,
handlers for asynchronous events cannot expect a deterministic state.
* Similarly to signal handlers, ptrace requests can observe the type of
saved state in the NT_ARM_SVE and NT_ARM_SSVE regsets, as this is
exposed in the header flags. As a result of the above commit, this is
now in a non-deterministic state after a syscall. The effective
register state is still deterministic.
Hopefully no-one relies on this being deterministic. In general,
debuggers would have to handle this changing at arbitrary points
during program flow.
Discarding the SVE state within fpsimd_save_user_state() resulted in
other changes to userspace visible behaviour which are not desirable:
* A ptrace tracer can modify (or create) a tracee's SVE state at syscall
entry or syscall exit. As a result of the above commit, the tracee's
SVE state can be discarded non-deterministically after modification,
rather than being retained as it previously was.
Note that for co-operative tracer/tracee pairs, the tracer may
(re)initialise the tracee's state arbitrarily after the tracee sends
itself an initial SIGSTOP via a syscall, so this affects realistic
design patterns.
* The current_pt_regs()->syscallno field can be modified via ptrace, and
can be altered even when the tracee is not really in a syscall,
causing non-deterministic discarding to occur in situations where this
was not previously possible.
Further, using current_pt_regs()->syscallno in this way is unsound:
* There are data races between readers and writers of the
current_pt_regs()->syscallno field.
The current_pt_regs()->syscallno field is written in interruptible
task context using plain C accesses, and is read in irq/softirq
context using plain C accesses. These accesses are subject to data
races, with the usual concerns with tearing, etc.
* Writes to current_pt_regs()->syscallno are subject to compiler
reordering.
As current_pt_regs()->syscallno is written with plain C accesses,
the compiler is free to move those writes arbitrarily relative to
anything which doesn't access the same memory location.
In theory this could break signal return, where prior to restoring the
SVE state, restore_sigframe() calls forget_syscall(). If the write
were hoisted after restore of some SVE state, that state could be
discarded unexpectedly.
In practice that reordering cannot happen in the absence of LTO (as
cross compilation-unit function calls happen prevent this reordering),
and that reordering appears to be unlikely in the presence of LTO.
Additionally, since commit:
f130ac0ae4412dbe ("arm64: syscall: unmask DAIF earlier for SVCs")
... DAIF is unmasked before el0_svc_common() sets regs->syscallno to the
real syscall number. Consequently state may be saved in SVE format prior
to this point.
Considering all of the above, current_pt_regs()->syscallno should not be
used to infer whether the SVE state can be discarded. Luckily we can
instead use cpu_fp_state::to_save to track when it is safe to discard
the SVE state:
* At syscall entry, after the live SVE register state is truncated, set
cpu_fp_state::to_save to FP_STATE_FPSIMD to indicate that only the
FPSIMD portion is live and needs to be saved.
* At syscall exit, once the task's state is guaranteed to be live, set
cpu_fp_state::to_save to FP_STATE_CURRENT to indicate that TIF_SVE
must be considered to determine which state needs to be saved.
* Whenever state is modified, it must be saved+flushed prior to
manipulation. The state will be truncated if necessary when it is
saved, and reloading the state will set fp_state::to_save to
FP_STATE_CURRENT, preventing subsequent discarding.
This permits SVE state to be discarded *only* when it is known to have
been truncated (and the non-FPSIMD portions must be zero), and ensures
that SVE state is retained after it is explicitly modified.
For backporting, note that this fix depends on the following commits:
* b2482807fbd4 ("arm64/sme: Optimise SME exit on syscall entry")
* f130ac0ae441 ("arm64: syscall: unmask DAIF earlier for SVCs")
* 929fa99b1215 ("arm64/fpsimd: signal: Always save+flush state early")
Fixes: 8c845e273104 ("arm64/sve: Leave SVE enabled on syscall if we don't context switch")
Fixes: f130ac0ae441 ("arm64: syscall: unmask DAIF earlier for SVCs")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250508132644.1395904-2-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Soon, KVM is going to use this logic for hypervisor panics,
so add it in a wrapper that can be used by the hypervisor exit
handler to decode hyp panics.
Signed-off-by: Mostafa Saleh <smostafa@google.com>
Reviewed-by: Kees Cook <kees@kernel.org>
Link: https://lore.kernel.org/r/20250430162713.1997569-2-smostafa@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
HCRX_HOST_FLAGS, like most of these hardcoded setups, are not
a good match for options that can be selectively enabled or
disabled.
Nothing but the early setup is relying on it now, so kill the
macro and move the bag of bits where they belong.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250430105916.3815157-3-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
The nVHE hypervisor needs to have access to its own view of the FGT
masks, which unfortunately results in a bit of data duplication.
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
In the process of decoupling KVM's view of the FGT bits from the
wider architectural state, use KVM's own FGT tables to build
a synthetic view of what is actually known.
This allows for some checking along the way.
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Provide the architected EC and ISS values for all the FEAT_LS64*
instructions.
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
A bunch of sysregs are now generated from the sysreg file, so no
need to carry separate definitions.
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Add the new CMOs trapped by HFGITR2_EL2.
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Bulk addition of all the system registers trapped by HDFG{R,W}TR2_EL2.
The descriptions are extracted from the BSD-licenced JSON file part
of the 2025-03 drop from ARM.
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Treating HFGRTR_EL2 and HFGWTR_EL2 identically was a mistake.
It makes things hard to reason about, has the potential to
introduce bugs by giving a meaning to bits that are really reserved,
and is in general a bad description of the architecture.
Given that #defines are cheap, let's describe both registers as
intended by the architecture, and repaint all the existing uses.
Yes, this is painful.
The registers themselves are generated from the JSON file in
an automated way.
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Add HCR_EL2 to the sysreg file, more or less directly generated
from the JSON file.
Since the generated names significantly differ from the existing
naming, express the old names in terms of the new one. One day, we'll
fix this mess, but I'm not in any hurry.
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
The pKVM selftest intends to test as many memory 'transitions' as
possible, so extend it to cover sharing pages with non-protected guests,
including in the case of multi-sharing.
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20250416160900.3078417-5-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
The hypervisor has not needed its own .data section because all globals
were either .rodata or .bss. To avoid having to initialize future
data-structures at run-time, let's introduce add a .data section to the
hypervisor.
Signed-off-by: David Brazdil <dbrazdil@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20250416160900.3078417-2-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
We keep setting and clearing these bits depending on the role of
the host kernel, mimicking what we do for nVHE. But that's actually
pretty pointless, as we always want physical interrupts to make it
to the host, at EL2.
This has also two problems:
- it prevents IRQs from being taken when these bits are cleared
if the implementation has chosen to implement these bits as
masks when HCR_EL2.{TGE,xMO}=={0,0}
- it triggers a bad erratum on the AmpereOne HW, which catches
fire on clearing these bits while an interrupt is being taken
(AC03_CPU_36).
Let's kill these two birds with a single stone, and permanently
set the xMO bits when running VHE. This involves a bit of surgery
on code paths that rely on flipping these bits on and off for
other purposes.
Note that the earliest setting of hcr_el2 (in the init_hcr_el2
macro) is left untouched as is runs extremely early, with interrupts
disabled, and soon enough overwritten with the final value containing
the xMO bits.
Reported-by: D Scott Phillips <scott@os.amperecomputing.com>
Link: https://lore.kernel.org/r/20250429114326.3618875-1-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
We keep setting and clearing these bits depending on the role of
the host kernel, mimicking what we do for nVHE. But that's actually
pretty pointless, as we always want physical interrupts to make it
to the host, at EL2.
This has also two problems:
- it prevents IRQs from being taken when these bits are cleared
if the implementation has chosen to implement these bits as
masks when HCR_EL2.{TGE,xMO}=={0,0}
- it triggers a bad erratum on the AmpereOne HW, which catches
fire on clearing these bits while an interrupt is being taken
(AC03_CPU_36).
Let's kill these two birds with a single stone, and permanently
set the xMO bits when running VHE. This involves a bit of surgery
on code paths that rely on flipping these bits on and off for
other purposes.
Note that the earliest setting of hcr_el2 (in the init_hcr_el2
macro) is left untouched as is runs extremely early, with interrupts
disabled, and soon enough overwritten with the final value containing
the xMO bits.
Reported-by: D Scott Phillips <scott@os.amperecomputing.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250429114326.3618875-1-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
All vDSO code needs to be completely position independent. Symbol
references are marked as hidden so the compiler emits PC-relative
relocations.
However GCC emits absolute relocations for symbol-relative references with
an offset >= 64KiB. After recent refactorings in the vDSO code this is the
case in __arch_get_vdso_u_timens_data() with a page size of 64KiB.
Work around the issue by preventing the optimizer from seeing the offsets.
Fixes: 83a2a6b8cfc5 ("vdso/gettimeofday: Prepare do_hres_timens() for introduction of struct vdso_clock")
Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/all/20250430-vdso-absolute-reloc-v2-1-5efcc3bc4b26@linutronix.de
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120002
Closes: https://lore.kernel.org/lkml/aApGPAoctq_eoE2g@t14ultra/
|
|
Now that gcc-8 and binutils-2.30 are the minimum versions, a lot of
the individual feature checks can go away for simplification.
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
Doing:
#include <linux/mem_encrypt.h>
Causes a bunch of compiler failures due to missing implicit includes that
don't happen on x86:
../arch/arm64/include/asm/rsi_cmds.h:117:2: error: call to undeclared library function 'memcpy' with type 'void *(void *, const void *, unsigned long)'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
117 | memcpy(®s.a1, challenge, size);
../arch/arm64/include/asm/mem_encrypt.h:19:49: warning: declaration of 'struct device' will not be visible outside of this function [-Wvisibility]
19 | static inline bool force_dma_unencrypted(struct device *dev)
../arch/arm64/include/asm/rsi_cmds.h:44:38: error: call to undeclared function 'virt_to_phys'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
44 | arm_smccc_smc(SMC_RSI_REALM_CONFIG, virt_to_phys(cfg),
Add the missing includes to the arch/arm headers to avoid this.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/0-v1-47aadfbd64cd+25795-arm_memenc_h_jgg@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
The KVM PV ABI recently added a feature that allows the VM to discover
the set of physical CPU implementations, identified by a tuple of
{MIDR_EL1, REVIDR_EL1, AIDR_EL1}. Unlike other KVM PV features, the
expectation is that the VMM implements the hypercall instead of KVM as
it has the authoritative view of where the VM gets scheduled.
To do this the VMM needs to know the values of these registers on any
CPU in the system. While MIDR_EL1 and REVIDR_EL1 are already exposed,
AIDR_EL1 is not. Provide it in sysfs along with the other identification
registers.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Link: https://lore.kernel.org/r/20250403231626.3181116-1-oliver.upton@linux.dev
Signed-off-by: Will Deacon <will@kernel.org>
|
|
While reading how `cntvct_el0` was read in the kernel, I found that
__arch_get_hw_counter() is doing something very similar to what
__arch_counter_get_cntvct() is already doing.
Use the existing __arch_counter_get_cntvct() function instead of
duplicating similar inline assembly code in __arch_get_hw_counter().
Both functions were performing nearly identical operations to read the
cntvct_el0 register. The only difference was that
__arch_get_hw_counter() included a memory clobber in its inline
assembly, which appears unnecessary in this context.
This change simplifies the code by eliminating duplicate functionality
and improves maintainability by centralizing the counter access logic in
a single implementation.
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20250407-arm-vdso-v1-1-7012de25b195@debian.org
Signed-off-by: Will Deacon <will@kernel.org>
|
|
For an architecture to enable CONFIG_ARCH_HAS_RESCHED_LAZY, two things are
required:
1) Adding a TIF_NEED_RESCHED_LAZY flag definition
2) Checking for TIF_NEED_RESCHED_LAZY in the appropriate locations
2) is handled in a generic manner by CONFIG_GENERIC_ENTRY, which isn't
(yet) implemented for arm64. However, outside of core scheduler code,
TIF_NEED_RESCHED_LAZY only needs to be checked on a kernel exit, meaning:
o return/entry to userspace.
o return/entry to guest.
The return/entry to a guest is all handled by xfer_to_guest_mode_handle_work()
which already does the right thing, so it can be left as-is.
arm64 doesn't use common entry's exit_to_user_mode_prepare(), so update its
return to user path to check for TIF_NEED_RESCHED_LAZY and call into
schedule() accordingly.
Link: https://lore.kernel.org/linux-rt-users/20241216190451.1c61977c@mordecai.tesarici.cz/
Link: https://lore.kernel.org/all/xhsmh4j0fl0p3.mognet@vschneid-thinkpadt14sgen2i.remote.csb/
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
[testdrive, _TIF_WORK_MASK fixlet and changelog.]
Signed-off-by: Mike Galbraith <efault@gmx.de>
[Another round of testing; changelog faff]
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/r/20250305104925.189198-2-vschneid@redhat.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
* kvm-arm64/nv-pmu-fixes:
: .
: Fixes for NV PMU emulation. From the cover letter:
:
: "Joey reports that some of his PMU tests do not behave quite as
: expected:
:
: - MDCR_EL2.HPMN is set to 0 out of reset
:
: - PMCR_EL0.P should reset all the counters when written from EL2
:
: Oliver points out that setting PMCR_EL0.N from userspace by writing to
: the register is silly with NV, and that we need a new PMU attribute
: instead.
:
: On top of that, I figured out that we had a number of little gotchas:
:
: - It is possible for a guest to write an HPMN value that is out of
: bound, and it seems valuable to limit it
:
: - PMCR_EL0.N should be the maximum number of counters when read from
: EL2, and MDCR_EL2.HPMN when read from EL0/EL1
:
: - Prevent userspace from updating PMCR_EL0.N when EL2 is available"
: .
KVM: arm64: Let kvm_vcpu_read_pmcr() return an EL-dependent value for PMCR_EL0.N
KVM: arm64: Handle out-of-bound write to MDCR_EL2.HPMN
KVM: arm64: Don't let userspace write to PMCR_EL0.N when the vcpu has EL2
KVM: arm64: Allow userspace to limit the number of PMU counters for EL2 VMs
KVM: arm64: Contextualise the handling of PMCR_EL0.P writes
KVM: arm64: Fix MDCR_EL2.HPMN reset value
KVM: arm64: Repaint pmcr_n into nr_pmu_counters
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
When dealing with a guest with SVE enabled, make sure the host SVE
state is pinned at EL2 S1, and that the hypervisor vCPU state is
correctly initialised (and then unpinned on teardown).
Co-authored-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Link: https://lore.kernel.org/r/20250416152648.2982950-2-qperret@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for 6.15, round #2
- Single fix for broken usage of 'multi-MIDR' infrastructure in PI
code, adding an open-coded erratum check for everyone's favorite pile
of sand: Cavium ThunderX
|
|
kvm_arch_has_irq_bypass() is a small function and even though it does
not appear in any *really* hot paths, it's also not entirely rare.
Make it inline---it also works out nicely in preparation for using it in
kvm-intel.ko and kvm-amd.ko, since the function is not currently exported.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Calling into the MIDR checking framework from the PI code has recently
become much harder, due to the new fancy "multi-MIDR" support that
relies on tables being populated at boot time, but not that early that
they are available to the PI code. There are additional issues with
this framework, as the code really isn't position independend *at all*.
This leads to some ugly breakages, as reported by Ada.
It so appears that the only reason for the PI code to call into the
MIDR checking code is to cope with The Most Broken ARM64 System Ever,
aka Cavium ThunderX, which cannot deal with nG attributes that result
of the combination of KASLR and KPTI as a consequence of Erratum 27456.
Duplicate the check for the erratum in the PI code, removing the
dependency on the bulk of the MIDR checking framework. This allows
dropping that same check from kaslr_requires_kpti(), as the KPTI code
already relies on the ARM64_WORKAROUND_CAVIUM_27456 cap.
Fixes: c8c2647e69bed ("arm64: Make _midr_in_range_list() an exported function")
Reported-by: Ada Couprie Diaz <ada.coupriediaz@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/3d97e45a-23cf-419b-9b6f-140b4d88de7b@arm.com
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Cc: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20250418093129.1755739-1-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Pull bpf fixes from Alexei Starovoitov:
- Followup fixes for resilient spinlock (Kumar Kartikeya Dwivedi):
- Make res_spin_lock test less verbose, since it was spamming BPF
CI on failure, and make the check for AA deadlock stronger
- Fix rebasing mistake and use architecture provided
res_smp_cond_load_acquire
- Convert BPF maps (queue_stack and ringbuf) to resilient spinlock
to address long standing syzbot reports
- Make sure that classic BPF load instruction from SKF_[NET|LL]_OFF
offsets works when skb is fragmeneted (Willem de Bruijn)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf: Convert ringbuf map to rqspinlock
bpf: Convert queue_stack map to rqspinlock
bpf: Use architecture provided res_smp_cond_load_acquire
selftests/bpf: Make res_spin_lock AA test condition stronger
selftests/net: test sk_filter support for SKF_NET_OFF on frags
bpf: support SKF_NET_OFF and SKF_LL_OFF on skb frags
selftests/bpf: Make res_spin_lock test less verbose
|
|
The pmcr_n field obviously refers to PMCR_EL0.N, but is generally used
as the number of counters seen by the guest. Rename it accordingly.
Suggested-by: Oliver upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
In v2 of rqspinlock [0], we fixed potential problems with WFE usage in
arm64 to fallback to a version copied from Ankur's series [1]. This
logic was moved into arch-specific headers in v3 [2].
However, we missed using the arch-provided res_smp_cond_load_acquire
in commit ebababcd0372 ("rqspinlock: Hardcode cond_acquire loops for arm64")
due to a rebasing mistake between v2 and v3 of the rqspinlock series.
Fix the typo to fallback to the arm64 definition as we did in v2.
[0]: https://lore.kernel.org/bpf/20250206105435.2159977-18-memxor@gmail.com
[1]: https://lore.kernel.org/lkml/20250203214911.898276-1-ankur.a.arora@oracle.com
[2]: https://lore.kernel.org/bpf/20250303152305.3195648-9-memxor@gmail.com
Fixes: ebababcd0372 ("rqspinlock: Hardcode cond_acquire loops for arm64")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250410145512.1876745-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
There are several issues with the way the native signal handling code
manipulates FPSIMD/SVE/SME state, described in detail below. These
issues largely result from races with preemption and inconsistent
handling of live state vs saved state.
Known issues with native FPSIMD/SVE/SME state management include:
* On systems with FPMR, the code to save/restore the FPMR accesses the
register while it is not owned by the current task. Consequently, this
may corrupt the FPMR of the current task and/or may corrupt the FPMR
of an unrelated task. The FPMR save/restore has been broken since it
was introduced in commit:
8c46def44409fc91 ("arm64/signal: Add FPMR signal handling")
* On systems with SME, setup_return() modifies both the live register
state and the saved state register state regardless of whether the
task's state is live, and without holding the cpu fpsimd context.
Consequently:
- This may corrupt the state an unrelated task which has PSTATE.SM set
and/or PSTATE.ZA set.
- The task may enter the signal handler in streaming mode, and or with
ZA storage enabled unexpectedly.
- The task may enter the signal handler in non-streaming SVE mode with
stale SVE register state, which may have been inherited from
streaming SVE mode unexpectedly. Where the streaming and
non-streaming vector lengths differ, this may be packed into
registers arbitrarily.
This logic has been broken since it was introduced in commit:
40a8e87bb32855b3 ("arm64/sme: Disable ZA and streaming mode when handling signals")
Further incorrect manipulation of state was added in commits:
ea64baacbc36a0d5 ("arm64/signal: Flush FPSIMD register state when disabling streaming mode")
baa8515281b30861 ("arm64/fpsimd: Track the saved FPSIMD state type separately to TIF_SVE")
* Several restoration functions use fpsimd_flush_task_state() to discard
the live FPSIMD/SVE/SME while the in-memory copy is stale.
When a subset of the FPSIMD/SVE/SME state is restored, the remainder
may be non-deterministically reset to a stale snapshot from some
arbitrary point in the past.
This non-deterministic discarding was introduced in commit:
8cd969d28fd2848d ("arm64/sve: Signal handling support")
As of that commit, when TIF_SVE was initially clear, failure to
restore the SVE signal frame could reset the FPSIMD registers to a
stale snapshot.
The pattern of discarding unsaved state was subsequently copied into
restoration functions for some new state in commits:
39782210eb7e8763 ("arm64/sme: Implement ZA signal handling")
ee072cf708048c0d ("arm64/sme: Implement signal handling for ZT")
* On systems with SME/SME2, the entire FPSIMD/SVE/SME state may be
loaded onto the CPU redundantly. Either restore_fpsimd_context() or
restore_sve_fpsimd_context() will load the entire FPSIMD/SVE/SME state
via fpsimd_update_current_state() before restore_za_context() and
restore_zt_context() each discard the state via
fpsimd_flush_task_state().
This is purely redundant work, and not a functional bug.
To fix these issues, rework the native signal handling code to always
save+flush the current task's FPSIMD/SVE/SME state before manipulating
that state. This avoids races with preemption and ensures that state is
manipulated consistently regardless of whether it happened to be live
prior to manipulation. This largely involes:
* Using fpsimd_save_and_flush_current_state() to save+flush the state
for both signal delivery and signal return, before the state is
manipulated in any way.
* Removing fpsimd_signal_preserve_current_state() and updating
preserve_fpsimd_context() to explicitly ensure that the FPSIMD state
is up-to-date, as preserve_fpsimd_context() is the only consumer of
the FPSIMD state during signal delivery.
* Modifying fpsimd_update_current_state() to not reload the FPSIMD state
onto the CPU. Ideally we'd remove fpsimd_update_current_state()
entirely, but I've left that for subsequent patches as there are a
number of of other problems with the FPSIMD<->SVE conversion helpers
that should be addressed at the same time. For now I've removed the
misleading comment.
For setup_return(), we need to decide (for ABI reasons) whether signal
delivery should have all the side-effects of an SMSTOP. For now I've
left a TODO comment, as there are other questions in this area that I'll
address with subsequent patches.
Fixes: 8c46def44409 ("arm64/signal: Add FPMR signal handling")
Fixes: 40a8e87bb328 ("arm64/sme: Disable ZA and streaming mode when handling signals")
Fixes: ea64baacbc36 ("arm64/signal: Flush FPSIMD register state when disabling streaming mode")
Fixes: baa8515281b3 ("arm64/fpsimd: Track the saved FPSIMD state type separately to TIF_SVE")
Fixes: 8cd969d28fd2 ("arm64/sve: Signal handling support")
Fixes: 39782210eb7e ("arm64/sme: Implement ZA signal handling")
Fixes: ee072cf70804 ("arm64/sme: Implement signal handling for ZT")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20250409164010.3480271-13-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
When the current task's FPSIMD/SVE/SME state may be live on *any* CPU in
the system, special care must be taken when manipulating that state, as
this manipulation can race with preemption and/or asynchronous usage of
FPSIMD/SVE/SME (e.g. kernel-mode NEON in softirq handlers).
Even when manipulation is is protected with get_cpu_fpsimd_context() and
get_cpu_fpsimd_context(), the logic necessary when the state is live on
the current CPU can be wildly different from the logic necessary when
the state is not live on the current CPU. A number of historical and
extant issues result from failing to handle these cases consistetntly
and/or correctly.
To make it easier to get such manipulation correct, add a new
fpsimd_save_and_flush_current_state() helper function, which ensures
that the current task's state has been saved to memory and any stale
state on any CPU has been "flushed" such that is not live on any CPU in
the system. This will allow code to safely manipulate the saved state
without risk of races.
Subsequent patches will use the new function.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20250409164010.3480271-11-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
There have been no users of fpsimd_force_sync_to_sve() since commit:
bbc6172eefdb276b ("arm64/fpsimd: SME no longer requires SVE register state")
Remove fpsimd_force_sync_to_sve().
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20250409164010.3480271-3-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
The SME trap handler consumes RES0 bits from the ESR when determining
the reason for the trap, and depends upon those bits reading as zero.
This may break in future when those RES0 bits are allocated a meaning
and stop reading as zero.
For SME traps taken with ESR_ELx.EC == 0b011101, the specific reason for
the trap is indicated by ESR_ELx.ISS.SMTC ("SME Trap Code"). This field
occupies bits [2:0] of ESR_ELx.ISS, and as of ARM DDI 0487 L.a, bits
[24:3] of ESR_ELx.ISS are RES0. ESR_ELx.ISS itself occupies bits [24:0]
of ESR_ELx.
Extract the SMTC field specifically, matching the way we handle ESR_ELx
fields elsewhere, and ensuring that the handler is future-proof.
Fixes: 8bd7f91c03d8 ("arm64/sme: Implement traps and syscall handling for SME")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20250409164010.3480271-2-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
Pull kvm fixes from Paolo Bonzini:
"ARM:
- Rework heuristics for resolving the fault IPA (HPFAR_EL2 v. re-walk
stage-1 page tables) to align with the architecture. This avoids
possibly taking an SEA at EL2 on the page table walk or using an
architecturally UNKNOWN fault IPA
- Use acquire/release semantics in the KVM FF-A proxy to avoid
reading a stale value for the FF-A version
- Fix KVM guest driver to match PV CPUID hypercall ABI
- Use Inner Shareable Normal Write-Back mappings at stage-1 in KVM
selftests, which is the only memory type for which atomic
instructions are architecturally guaranteed to work
s390:
- Don't use %pK for debug printing and tracepoints
x86:
- Use a separate subclass when acquiring KVM's per-CPU posted
interrupts wakeup lock in the scheduled out path, i.e. when adding
a vCPU on the list of vCPUs to wake, to workaround a false positive
deadlock. The schedule out code runs with a scheduler lock that the
wakeup handler takes in the opposite order; but it does so with
IRQs disabled and cannot run concurrently with a wakeup
- Explicitly zero-initialize on-stack CPUID unions
- Allow building irqbypass.ko as as module when kvm.ko is a module
- Wrap relatively expensive sanity check with KVM_PROVE_MMU
- Acquire SRCU in KVM_GET_MP_STATE to protect guest memory accesses
selftests:
- Add more scenarios to the MONITOR/MWAIT test
- Add option to rseq test to override /dev/cpu_dma_latency
- Bring list of exit reasons up to date
- Cleanup Makefile to list once tests that are valid on all
architectures
Other:
- Documentation fixes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (26 commits)
KVM: arm64: Use acquire/release to communicate FF-A version negotiation
KVM: arm64: selftests: Explicitly set the page attrs to Inner-Shareable
KVM: arm64: selftests: Introduce and use hardware-definition macros
KVM: VMX: Use separate subclasses for PI wakeup lock to squash false positive
KVM: VMX: Assert that IRQs are disabled when putting vCPU on PI wakeup list
KVM: x86: Explicitly zero-initialize on-stack CPUID unions
KVM: Allow building irqbypass.ko as as module when kvm.ko is a module
KVM: x86/mmu: Wrap sanity check on number of TDP MMU pages with KVM_PROVE_MMU
KVM: selftests: Add option to rseq test to override /dev/cpu_dma_latency
KVM: x86: Acquire SRCU in KVM_GET_MP_STATE to protect guest memory accesses
Documentation: kvm: remove KVM_CAP_MIPS_TE
Documentation: kvm: organize capabilities in the right section
Documentation: kvm: fix some definition lists
Documentation: kvm: drop "Capability" heading from capabilities
Documentation: kvm: give correct name for KVM_CAP_SPAPR_MULTITCE
Documentation: KVM: KVM_GET_SUPPORTED_CPUID now exposes TSC_DEADLINE
selftests: kvm: list once tests that are valid on all architectures
selftests: kvm: bring list of exit reasons up to date
selftests: kvm: revamp MONITOR/MWAIT tests
KVM: arm64: Don't translate FAR if invalid/unsafe
...
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64: First batch of fixes for 6.15
- Rework heuristics for resolving the fault IPA (HPFAR_EL2 v. re-walk
stage-1 page tables) to align with the architecture. This avoids
possibly taking an SEA at EL2 on the page table walk or using an
architecturally UNKNOWN fault IPA.
- Use acquire/release semantics in the KVM FF-A proxy to avoid reading
a stale value for the FF-A version.
- Fix KVM guest driver to match PV CPUID hypercall ABI.
- Use Inner Shareable Normal Write-Back mappings at stage-1 in KVM
selftests, which is the only memory type for which atomic
instructions are architecturally guaranteed to work.
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Catalin Marinas:
- Fix max_pfn calculation when hotplugging memory so that it never
decreases
- Fix dereference of unused source register in the MOPS SET operation
fault handling
- Fix NULL calling in do_compat_alignment_fixup() when the 32-bit user
space does an unaligned LDREX/STREX
- Add the HiSilicon HIP09 processor to the Spectre-BHB affected CPUs
- Drop unused code pud accessors (special/mkspecial)
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: Don't call NULL in do_compat_alignment_fixup()
arm64: Add support for HIP09 Spectre-BHB mitigation
arm64: mm: Drop dead code for pud special bit handling
arm64: mops: Do not dereference src reg for a set operation
arm64: mm: Correct the update of max_pfn
|
|
Don't re-walk the page tables if an SEA occurred during the faulting
page table walk to avoid taking a fatal exception in the hyp.
Additionally, check that FAR_EL2 is valid for SEAs not taken on PTW
as the architecture doesn't guarantee it contains the fault VA.
Finally, fix up the rest of the abort path by checking for SEAs early
and bugging the VM if we get further along with an UNKNOWN fault IPA.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250402201725.2963645-4-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Switch over to the typical sysreg table for HPFAR_EL2 as we're about to
start using more fields in the register.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250402201725.2963645-3-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
KVM's logic for deciding when HPFAR_EL2 is UNKNOWN doesn't align with
the architecture. Most notably, KVM assumes HPFAR_EL2 contains the
faulting IPA even in the case of an SEA.
Align the logic with the architecture rather than attempting to
paraphrase it. Additionally, take the opportunity to improve the
language around ARM erratum #834220 such that it actually describes the
bug.
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20250402201725.2963645-2-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- The series "Enable strict percpu address space checks" from Uros
Bizjak uses x86 named address space qualifiers to provide
compile-time checking of percpu area accesses.
This has caused a small amount of fallout - two or three issues were
reported. In all cases the calling code was found to be incorrect.
- The series "Some cleanup for memcg" from Chen Ridong implements some
relatively monir cleanups for the memcontrol code.
- The series "mm: fixes for device-exclusive entries (hmm)" from David
Hildenbrand fixes a boatload of issues which David found then using
device-exclusive PTE entries when THP is enabled. More work is
needed, but this makes thins better - our own HMM selftests now
succeed.
- The series "mm: zswap: remove z3fold and zbud" from Yosry Ahmed
remove the z3fold and zbud implementations. They have been deprecated
for half a year and nobody has complained.
- The series "mm: further simplify VMA merge operation" from Lorenzo
Stoakes implements numerous simplifications in this area. No runtime
effects are anticipated.
- The series "mm/madvise: remove redundant mmap_lock operations from
process_madvise()" from SeongJae Park rationalizes the locking in the
madvise() implementation. Performance gains of 20-25% were observed
in one MADV_DONTNEED microbenchmark.
- The series "Tiny cleanup and improvements about SWAP code" from
Baoquan He contains a number of touchups to issues which Baoquan
noticed when working on the swap code.
- The series "mm: kmemleak: Usability improvements" from Catalin
Marinas implements a couple of improvements to the kmemleak
user-visible output.
- The series "mm/damon/paddr: fix large folios access and schemes
handling" from Usama Arif provides a couple of fixes for DAMON's
handling of large folios.
- The series "mm/damon/core: fix wrong and/or useless damos_walk()
behaviors" from SeongJae Park fixes a few issues with the accuracy of
kdamond's walking of DAMON regions.
- The series "expose mapping wrprotect, fix fb_defio use" from Lorenzo
Stoakes changes the interaction between framebuffer deferred-io and
core MM. No functional changes are anticipated - this is preparatory
work for the future removal of page structure fields.
- The series "mm/damon: add support for hugepage_size DAMOS filter"
from Usama Arif adds a DAMOS filter which permits the filtering by
huge page sizes.
- The series "mm: permit guard regions for file-backed/shmem mappings"
from Lorenzo Stoakes extends the guard region feature from its
present "anon mappings only" state. The feature now covers shmem and
file-backed mappings.
- The series "mm: batched unmap lazyfree large folios during
reclamation" from Barry Song cleans up and speeds up the unmapping
for pte-mapped large folios.
- The series "reimplement per-vma lock as a refcount" from Suren
Baghdasaryan puts the vm_lock back into the vma. Our reasons for
pulling it out were largely bogus and that change made the code more
messy. This patchset provides small (0-10%) improvements on one
microbenchmark.
- The series "Docs/mm/damon: misc DAMOS filters documentation fixes and
improves" from SeongJae Park does some maintenance work on the DAMON
docs.
- The series "hugetlb/CMA improvements for large systems" from Frank
van der Linden addresses a pile of issues which have been observed
when using CMA on large machines.
- The series "mm/damon: introduce DAMOS filter type for unmapped pages"
from SeongJae Park enables users of DMAON/DAMOS to filter my the
page's mapped/unmapped status.
- The series "zsmalloc/zram: there be preemption" from Sergey
Senozhatsky teaches zram to run its compression and decompression
operations preemptibly.
- The series "selftests/mm: Some cleanups from trying to run them" from
Brendan Jackman fixes a pile of unrelated issues which Brendan
encountered while runnimg our selftests.
- The series "fs/proc/task_mmu: add guard region bit to pagemap" from
Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to
determine whether a particular page is a guard page.
- The series "mm, swap: remove swap slot cache" from Kairui Song
removes the swap slot cache from the allocation path - it simply
wasn't being effective.
- The series "mm: cleanups for device-exclusive entries (hmm)" from
David Hildenbrand implements a number of unrelated cleanups in this
code.
- The series "mm: Rework generic PTDUMP configs" from Anshuman Khandual
implements a number of preparatoty cleanups to the GENERIC_PTDUMP
Kconfig logic.
- The series "mm/damon: auto-tune aggregation interval" from SeongJae
Park implements a feedback-driven automatic tuning feature for
DAMON's aggregation interval tuning.
- The series "Fix lazy mmu mode" from Ryan Roberts fixes some issues in
powerpc, sparc and x86 lazy MMU implementations. Ryan did this in
preparation for implementing lazy mmu mode for arm64 to optimize
vmalloc.
- The series "mm/page_alloc: Some clarifications for migratetype
fallback" from Brendan Jackman reworks some commentary to make the
code easier to follow.
- The series "page_counter cleanup and size reduction" from Shakeel
Butt cleans up the page_counter code and fixes a size increase which
we accidentally added late last year.
- The series "Add a command line option that enables control of how
many threads should be used to allocate huge pages" from Thomas
Prescher does that. It allows the careful operator to significantly
reduce boot time by tuning the parallalization of huge page
initialization.
- The series "Fix calculations in trace_balance_dirty_pages() for cgwb"
from Tang Yizhou fixes the tracing output from the dirty page
balancing code.
- The series "mm/damon: make allow filters after reject filters useful
and intuitive" from SeongJae Park improves the handling of allow and
reject filters. Behaviour is made more consistent and the documention
is updated accordingly.
- The series "Switch zswap to object read/write APIs" from Yosry Ahmed
updates zswap to the new object read/write APIs and thus permits the
removal of some legacy code from zpool and zsmalloc.
- The series "Some trivial cleanups for shmem" from Baolin Wang does as
it claims.
- The series "fs/dax: Fix ZONE_DEVICE page reference counts" from
Alistair Popple regularizes the weird ZONE_DEVICE page refcount
handling in DAX, permittig the removal of a number of special-case
checks.
- The series "refactor mremap and fix bug" from Lorenzo Stoakes is a
preparatoty refactoring and cleanup of the mremap() code.
- The series "mm: MM owner tracking for large folios (!hugetlb) +
CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in
which we determine whether a large folio is known to be mapped
exclusively into a single MM.
- The series "mm/damon: add sysfs dirs for managing DAMOS filters based
on handling layers" from SeongJae Park adds a couple of new sysfs
directories to ease the management of DAMON/DAMOS filters.
- The series "arch, mm: reduce code duplication in mem_init()" from
Mike Rapoport consolidates many per-arch implementations of
mem_init() into code generic code, where that is practical.
- The series "mm/damon/sysfs: commit parameters online via
damon_call()" from SeongJae Park continues the cleaning up of sysfs
access to DAMON internal data.
- The series "mm: page_ext: Introduce new iteration API" from Luiz
Capitulino reworks the page_ext initialization to fix a boot-time
crash which was observed with an unusual combination of compile and
cmdline options.
- The series "Buddy allocator like (or non-uniform) folio split" from
Zi Yan reworks the code to split a folio into smaller folios. The
main benefit is lessened memory consumption: fewer post-split folios
are generated.
- The series "Minimize xa_node allocation during xarry split" from Zi
Yan reduces the number of xarray xa_nodes which are generated during
an xarray split.
- The series "drivers/base/memory: Two cleanups" from Gavin Shan
performs some maintenance work on the drivers/base/memory code.
- The series "Add tracepoints for lowmem reserves, watermarks and
totalreserve_pages" from Martin Liu adds some more tracepoints to the
page allocator code.
- The series "mm/madvise: cleanup requests validations and
classifications" from SeongJae Park cleans up some warts which
SeongJae observed during his earlier madvise work.
- The series "mm/hwpoison: Fix regressions in memory failure handling"
from Shuai Xue addresses two quite serious regressions which Shuai
has observed in the memory-failure implementation.
- The series "mm: reliable huge page allocator" from Johannes Weiner
makes huge page allocations cheaper and more reliable by reducing
fragmentation.
- The series "Minor memcg cleanups & prep for memdescs" from Matthew
Wilcox is preparatory work for the future implementation of memdescs.
- The series "track memory used by balloon drivers" from Nico Pache
introduces a way to track memory used by our various balloon drivers.
- The series "mm/damon: introduce DAMOS filter type for active pages"
from Nhat Pham permits users to filter for active/inactive pages,
separately for file and anon pages.
- The series "Adding Proactive Memory Reclaim Statistics" from Hao Jia
separates the proactive reclaim statistics from the direct reclaim
statistics.
- The series "mm/vmscan: don't try to reclaim hwpoison folio" from
Jinjiang Tu fixes our handling of hwpoisoned pages within the reclaim
code.
* tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (431 commits)
mm/page_alloc: remove unnecessary __maybe_unused in order_to_pindex()
x86/mm: restore early initialization of high_memory for 32-bits
mm/vmscan: don't try to reclaim hwpoison folio
mm/hwpoison: introduce folio_contain_hwpoisoned_page() helper
cgroup: docs: add pswpin and pswpout items in cgroup v2 doc
mm: vmscan: split proactive reclaim statistics from direct reclaim statistics
selftests/mm: speed up split_huge_page_test
selftests/mm: uffd-unit-tests support for hugepages > 2M
docs/mm/damon/design: document active DAMOS filter type
mm/damon: implement a new DAMOS filter type for active pages
fs/dax: don't disassociate zero page entries
MM documentation: add "Unaccepted" meminfo entry
selftests/mm: add commentary about 9pfs bugs
fork: use __vmalloc_node() for stack allocation
docs/mm: Physical Memory: Populate the "Zones" section
xen: balloon: update the NR_BALLOON_PAGES state
hv_balloon: update the NR_BALLOON_PAGES state
balloon_compaction: update the NR_BALLOON_PAGES state
meminfo: add a per node counter for balloon drivers
mm: remove references to folio in __memcg_kmem_uncharge_page()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Pull bpf relisient spinlock support from Alexei Starovoitov:
"This patch set introduces Resilient Queued Spin Lock (or rqspinlock
with res_spin_lock() and res_spin_unlock() APIs).
This is a qspinlock variant which recovers the kernel from a stalled
state when the lock acquisition path cannot make forward progress.
This can occur when a lock acquisition attempt enters a deadlock
situation (e.g. AA, or ABBA), or more generally, when the owner of the
lock (which we’re trying to acquire) isn’t making forward progress.
Deadlock detection is the main mechanism used to provide instant
recovery, with the timeout mechanism acting as a final line of
defense. Detection is triggered immediately when beginning the waiting
loop of a lock slow path.
Additionally, BPF programs attached to different parts of the kernel
can introduce new control flow into the kernel, which increases the
likelihood of deadlocks in code not written to handle reentrancy.
There have been multiple syzbot reports surfacing deadlocks in
internal kernel code due to the diverse ways in which BPF programs can
be attached to different parts of the kernel. By switching the BPF
subsystem’s lock usage to rqspinlock, all of these issues are
mitigated at runtime.
This spin lock implementation allows BPF maps to become safer and
remove mechanisms that have fallen short in assuring safety when
nesting programs in arbitrary ways in the same context or across
different contexts.
We run benchmarks that stress locking scalability and perform
comparison against the baseline (qspinlock). For the rqspinlock case,
we replace the default qspinlock with it in the kernel, such that all
spin locks in the kernel use the rqspinlock slow path. As such,
benchmarks that stress kernel spin locks end up exercising rqspinlock.
More details in the cover letter in commit 6ffb9017e932 ("Merge branch
'resilient-queued-spin-lock'")"
* tag 'bpf_res_spin_lock' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (24 commits)
selftests/bpf: Add tests for rqspinlock
bpf: Maintain FIFO property for rqspinlock unlock
bpf: Implement verifier support for rqspinlock
bpf: Introduce rqspinlock kfuncs
bpf: Convert lpm_trie.c to rqspinlock
bpf: Convert percpu_freelist.c to rqspinlock
bpf: Convert hashtab.c to rqspinlock
rqspinlock: Add locktorture support
rqspinlock: Add entry to Makefile, MAINTAINERS
rqspinlock: Add macros for rqspinlock usage
rqspinlock: Add basic support for CONFIG_PARAVIRT
rqspinlock: Add a test-and-set fallback
rqspinlock: Add deadlock detection and recovery
rqspinlock: Protect waiters in trylock fallback from stalls
rqspinlock: Protect waiters in queue from stalls
rqspinlock: Protect pending bit owners from stalls
rqspinlock: Hardcode cond_acquire loops for arm64
rqspinlock: Add support for timeouts
rqspinlock: Drop PV and virtualization support
rqspinlock: Add rqspinlock.h header
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Pull bpf updates from Alexei Starovoitov:
"For this merge window we're splitting BPF pull request into three for
higher visibility: main changes, res_spin_lock, try_alloc_pages.
These are the main BPF changes:
- Add DFA-based live registers analysis to improve verification of
programs with loops (Eduard Zingerman)
- Introduce load_acquire and store_release BPF instructions and add
x86, arm64 JIT support (Peilin Ye)
- Fix loop detection logic in the verifier (Eduard Zingerman)
- Drop unnecesary lock in bpf_map_inc_not_zero() (Eric Dumazet)
- Add kfunc for populating cpumask bits (Emil Tsalapatis)
- Convert various shell based tests to selftests/bpf/test_progs
format (Bastien Curutchet)
- Allow passing referenced kptrs into struct_ops callbacks (Amery
Hung)
- Add a flag to LSM bpf hook to facilitate bpf program signing
(Blaise Boscaccy)
- Track arena arguments in kfuncs (Ihor Solodrai)
- Add copy_remote_vm_str() helper for reading strings from remote VM
and bpf_copy_from_user_task_str() kfunc (Jordan Rome)
- Add support for timed may_goto instruction (Kumar Kartikeya
Dwivedi)
- Allow bpf_get_netns_cookie() int cgroup_skb programs (Mahe Tardy)
- Reduce bpf_cgrp_storage_busy false positives when accessing cgroup
local storage (Martin KaFai Lau)
- Introduce bpf_dynptr_copy() kfunc (Mykyta Yatsenko)
- Allow retrieving BTF data with BTF token (Mykyta Yatsenko)
- Add BPF kfuncs to set and get xattrs with 'security.bpf.' prefix
(Song Liu)
- Reject attaching programs to noreturn functions (Yafang Shao)
- Introduce pre-order traversal of cgroup bpf programs (Yonghong
Song)"
* tag 'bpf-next-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (186 commits)
selftests/bpf: Add selftests for load-acquire/store-release when register number is invalid
bpf: Fix out-of-bounds read in check_atomic_load/store()
libbpf: Add namespace for errstr making it libbpf_errstr
bpf: Add struct_ops context information to struct bpf_prog_aux
selftests/bpf: Sanitize pointer prior fclose()
selftests/bpf: Migrate test_xdp_vlan.sh into test_progs
selftests/bpf: test_xdp_vlan: Rename BPF sections
bpf: clarify a misleading verifier error message
selftests/bpf: Add selftest for attaching fexit to __noreturn functions
bpf: Reject attaching fexit/fmod_ret to __noreturn functions
bpf: Only fails the busy counter check in bpf_cgrp_storage_get if it creates storage
bpf: Make perf_event_read_output accessible in all program types.
bpftool: Using the right format specifiers
bpftool: Add -Wformat-signedness flag to detect format errors
selftests/bpf: Test freplace from user namespace
libbpf: Pass BPF token from find_prog_btf_id to BPF_BTF_GET_FD_BY_ID
bpf: Return prog btf_id without capable check
bpf: BPF token support for BPF_BTF_GET_FD_BY_ID
bpf, x86: Fix objtool warning for timed may_goto
bpf: Check map->record at the beginning of check_and_free_fields()
...
|
|
The HIP09 processor is vulnerable to the Spectre-BHB (Branch History
Buffer) attack, which can be exploited to leak information through
branch prediction side channels. This commit adds the MIDR of HIP09
to the list for software mitigation.
Signed-off-by: Jinqian Yang <yangjinqian1@huawei.com>
Link: https://lore.kernel.org/r/20250325141900.2057314-1-yangjinqian1@huawei.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
Keith Busch observed some incorrect macros defined in arm64 code [1].
It turns out the two lines should never be needed and won't be exposed to
anyone, because aarch64 doesn't select HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD,
hence ARCH_SUPPORTS_PUD_PFNMAP is always N. The only archs that support
THP PUDs so far are x86 and powerpc.
Instead of fixing the lines (with no way to test it..), remove the two
lines that are in reality dead code, to avoid confusing readers.
Fixes tag is attached to reflect where the wrong macros were introduced,
but explicitly not copying stable, because there's no real issue to be
fixed. So it's only about removing the dead code so far.
[1] https://lore.kernel.org/all/Z9tDjOk-JdV_fCY4@kbusch-mbp.dhcp.thefacebook.com/#t
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Donald Dutile <ddutile@redhat.com>
Cc: Will Deacon <will@kernel.org>
Fixes: 3e509c9b03f9 ("mm/arm64: support large pfn mappings")
Reported-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Donald Dutile <ddutile@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Link: https://lore.kernel.org/r/20250320183405.12659-1-peterx@redhat.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
The source register is not used for SET* and reading it can result in
a UBSAN out-of-bounds array access error, specifically when the MOPS
exception is taken from a SET* sequence with XZR (reg 31) as the
source. Architecturally this is the only case where a src/dst/size
field in the ESR can be reported as 31.
Prior to 2de451a329cf662b the code in do_el0_mops() was benign as the
use of pt_regs_read_reg() prevented the out-of-bounds access.
Fixes: 2de451a329cf ("KVM: arm64: Add handler for MOPS exceptions")
Cc: <stable@vger.kernel.org> # 6.12.x
Cc: Kristina Martsenko <kristina.martsenko@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: stable@vger.kernel.org
Reviewed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Keir Fraser <keirf@google.com>
Reviewed-by: Kristina Martšenko <kristina.martsenko@arm.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20250326110448.3792396-1-keirf@google.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv updates from Wei Liu:
- Add support for running as the root partition in Hyper-V (Microsoft
Hypervisor) by exposing /dev/mshv (Nuno and various people)
- Add support for CPU offlining in Hyper-V (Hamza Mahfooz)
- Misc fixes and cleanups (Roman Kisel, Tianyu Lan, Wei Liu, Michael
Kelley, Thorsten Blum)
* tag 'hyperv-next-signed-20250324' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: (24 commits)
x86/hyperv: fix an indentation issue in mshyperv.h
x86/hyperv: Add comments about hv_vpset and var size hypercall input args
Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs
hyperv: Add definitions for root partition driver to hv headers
x86: hyperv: Add mshv_handler() irq handler and setup function
Drivers: hv: Introduce per-cpu event ring tail
Drivers: hv: Export some functions for use by root partition module
acpi: numa: Export node_to_pxm()
hyperv: Introduce hv_recommend_using_aeoi()
arm64/hyperv: Add some missing functions to arm64
x86/mshyperv: Add support for extended Hyper-V features
hyperv: Log hypercall status codes as strings
x86/hyperv: Fix check of return value from snp_set_vmsa()
x86/hyperv: Add VTL mode callback for restarting the system
x86/hyperv: Add VTL mode emergency restart callback
hyperv: Remove unused union and structs
hyperv: Add CONFIG_MSHV_ROOT to gate root partition support
hyperv: Change hv_root_partition into a function
hyperv: Convert hypercall statuses to linux error codes
drivers/hv: add CPU offlining support
...
|