Age | Commit message (Collapse) | Author |
|
Reservation of the PMU hardware is done at first event creation
and is protected by a pair of mutex_lock() and mutex_unlock().
After reservation of the PMU hardware the memory
required for the PMUs the event is to be installed on is
allocated by allocate_buffers() and alloc_sampling_buffer().
This done outside of the mutex protection.
Without mutex protection two or more concurrent invocations of
perf_event_init() may run in parallel.
This can lead to allocation of Sample Data Blocks (SDBs)
multiple times for the same PMU.
Prevent this and protect memory allocation of SDBs by
mutex.
Fixes: 8a6fe8f21ec4 ("s390/cpum_sf: Use refcount_t instead of atomic_t")
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Reviewed-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Reviewed-by: Hendrik Brueckner <brueckner@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
|
|
The ACPI Boot Error Record Table (BERT) is being used by the kernel to report
errors that occurred in a previous boot. On some modern AMD systems, these
very errors within the BERT are reported through the x86 Common Platform Error
Record (CPER) format which consists of one or more Processor Context
Information Structures.
These context structures provide a starting address and represent an x86 MSR
range in which the data constitutes a contiguous set of MSRs starting from,
and including the starting address.
It's common, for AMD systems that implement this behavior, that the MSR range
represents the MCAX register space used for the Scalable MCA feature. The
apei_smca_report_x86_error() function decodes and passes this information
through the MCE notifier chain. However, this function assumes a fixed
register size based on the original HW/FW implementation.
This assumption breaks with the addition of two new MCAX registers viz.
MCA_SYND1 and MCA_SYND2. These registers are added at the end of the MCAX
register space, so they won't be included when decoding the CPER data.
Rework apei_smca_report_x86_error() to support a variable register array size.
This covers any case where the MSR context information starts at the MCAX
address for MCA_STATUS and ends at any other register within the MCAX register
space.
[ Yazen: Add Avadhut as co-developer for wrapper changes.]
[ bp: Massage. ]
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Co-developed-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Link: https://lore.kernel.org/r/20241022194158.110073-5-avadhut.naik@amd.com
|
|
The clockevents core already detached and unregistered it at this stage.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20241029125451.54574-5-frederic@kernel.org
|
|
Starting with Zen4, AMD's Scalable MCA systems incorporate two new registers:
MCA_SYND1 and MCA_SYND2.
These registers will include supplemental error information in addition to the
existing MCA_SYND register. The data within these registers is considered
valid if MCA_STATUS[SyndV] is set.
Userspace error decoding tools like rasdaemon gather related hardware error
information through the tracepoints.
Therefore, export these two registers through the mce_record tracepoint so
that tools like rasdaemon can parse them and output the supplemental error
information like FRU text contained in them.
[ bp: Massage. ]
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Link: https://lore.kernel.org/r/20241022194158.110073-4-avadhut.naik@amd.com
|
|
Until now, we didn't really care about WXN as it didn't have an
effect on the R/W permissions (only the execution could be droppped),
and therefore not of interest for AT.
However, with S1POE, WXN can revoke the Write permission if an
overlay is active and that execution is allowed. This *is* relevant
to AT.
Add full handling of WXN so that we correctly handle this case.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-38-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
We now have the intrastructure in place to emulate S1POE:
- direct permissions are always overlay-capable
- indirect permissions are overlay-capable if the permissions are
in the 0b0xxx range
- the overlays are strictly substractive
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-37-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Move the conditions describing PAN as part of the s1_walk_info
structure, in an effort to declutter the permission processing.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-36-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
The hierarchical permissions must be disabled when POE is enabled
in the translation regime used for a given table walk.
We store the two enable bits in the s1_walk_info structure so that
they can be retrieved down the line, as they will be useful.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20241023145345.1613824-35-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Just like the other extensions affecting address translation,
we must save/restore POE so that an out-of-context translation
context can be restored and used with the AT instructions.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20241023145345.1613824-34-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
POR_EL2 needs saving when the guest is VHE, and restoring in
any case.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-33-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
S1POE support implies support for POR_EL2, which we provide by
- adding it to the vcpu_sysreg enum
- advertising it as mapped to its EL1 counterpart in get_el2_to_el1_mapping
- wiring it in the sys_reg_desc table with the correct visibility
- handling POR_EL1 in __vcpu_{read,write}_sys_reg_from_cpu()
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-32-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Just like we have kvm_has_s1pie(), add its S1POE counterpart,
making the code slightly more readable.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-31-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
All the El0/EL1 S1PIE/S1POE system register are caught by the HCR_EL2
TVM and TRVM bits. Reflect this in the coarse grained trap table.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-30-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
It took me some time to realise it, but CPTR_EL2.E0POE does not
apply to a guest, only to EL0 when InHost(). And when InHost(),
CPCR_EL2 is mapped to CPACR_EL1, maning that the E0POE bit naturally
takes effect without any trap.
To sum it up, this trap bit is better left ignored, we will never
have to hanedle it.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-29-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
POR_EL2 is the equivalent of POR_EL1 for the EL2&0 translation
regime, and it is sorely missing from the sysreg file.
Add the sucker.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-28-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
With a visibility defined for these registers, there is no need
to check again for S1PIE or TCRX being implemented as perform_access()
already handles it.
Reviewed-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-27-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
When the guest does not support S1PIE we should not allow any access
to the system registers it adds in order to ensure that we do not create
spurious issues with guest migration. Add a visibility operation for these
registers.
Fixes: 86f9de9db178 ("KVM: arm64: Save/restore PIE registers")
Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20240822-kvm-arm64-hide-pie-regs-v2-3-376624fa829c@kernel.org
[maz: simplify by using __el2_visibility(), kvm_has_s1pie() throughout]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-26-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
When the guest does not support FEAT_TCR2 we should not allow any access
to it in order to ensure that we do not create spurious issues with guest
migration. Add a visibility operation for it.
Fixes: fbff56068232 ("KVM: arm64: Save/restore TCR2_EL1")
Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20240822-kvm-arm64-hide-pie-regs-v2-2-376624fa829c@kernel.org
[maz: simplify by using __el2_visibility(), kvm_has_tcr2() throughout]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-25-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
In preparation for adding more visibility filtering for EL2 registers add
a helper macro like EL2_REG() which allows specification of a custom
visibility operation.
Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20240822-kvm-arm64-hide-pie-regs-v2-1-376624fa829c@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-24-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
We are starting to have a bunch of visibility helpers checking
for EL2 + something else, and we are going to add more.
Simplify things somehow by introducing a helper that implement
extractly that by taking a visibility helper as a parameter,
and convert the existing ones to that.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-23-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
It doesn't take much effort to implement S1PIE support in AT.
It is only a matter of using the AArch64.S1IndirectBasePermissions()
encodings for the permission, ignoring GCS which has no impact on AT,
and enforce FEAT_PAN3 being enabled as this is a requirement of
FEAT_S1PIE.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20241023145345.1613824-22-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
S1PIE implicitly disables hierarchical permissions, as specified in
R_JHSVW, by making TCR_ELx.HPDn RES1.
Add a predicate for S1PIE being enabled for a given translation regime,
and emulate this behaviour by forcing the hpd field to true if S1PIE
is enabled for that translation regime.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20241023145345.1613824-21-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
The AArch64.S1DirectBasePermissions() pseudocode deals with both
direct and hierarchical S1 permission evaluation. While this is
probably convenient in the pseudocode, we would like a bit more
flexibility to slot things like indirect permissions.
To that effect, split the two permission check parts out of
handle_at_slow() and into their own functions. The permissions
are passed around as part of the walk_result structure.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-20-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Emulating AT using AT instructions requires that the live state
matches the translation regime the AT instruction targets.
If targeting the EL1&0 translation regime and that S1PIE is
supported, we also need to restore that state (covering TCR2_EL1,
PIR_EL1, and PIRE0_EL1).
Add the required system register switcheroo.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20241023145345.1613824-19-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Add the FEAT_S1PIE EL2 registers the sysreg descriptor array so that
they can be handled as a trap.
Access to these registers is conditional based on ID_AA64MMFR3_EL1.S1PIE
being advertised.
Similarly to other other changes, PIRE0_EL2 is guaranteed to trap
thanks to the D22677 update to the architecture.
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-17-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Like their EL1 equivalent, the EL2-specific FEAT_S1PIE registers
are context-switched. This is made conditional on both FEAT_TCRX
and FEAT_S1PIE being adversised.
Note that this change only makes sense if read together with the
issue D22677 contained in 102105_K.a_04_en.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-16-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Add the FEAT_S1PIE EL2 registers to the per-vcpu sysreg register
array.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-15-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
We currently only use the masking (RES0/RES1) facility for VNCR
registers, as they are memory-based and thus easy to sanitise.
But we could apply the same thing to other registers if we:
- split the sanitisation from __VNCR_START__
- apply the sanitisation when reading from a HW register
This involves a new "marker" in the vcpu_sysreg enum, which
defines the point at which the sanitisation applies (the VNCR
registers being of course after this marker).
Whle we are at it, rename kvm_vcpu_sanitise_vncr_reg() to
kvm_vcpu_apply_reg_masks(), which is vaguely more explicit,
and harden set_sysreg_masks() against setting masks for
random registers...
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20241023145345.1613824-10-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Like its EL1 equivalent, TCR2_EL2 gets context-switched.
This is made conditional on FEAT_TCRX being adversised.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-14-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
For code that accesses any of the guest registers for emulation
purposes, it is crucial to know where the most up-to-date data is.
While this is pretty clear for nVHE (memory is the sole repository),
things are a lot muddier for VHE, as depending on the SYSREGS_ON_CPU
flag, registers can either be loaded on the HW or be in memory.
Even worse with NV, where the loaded state is by definition partial.
For these reasons, KVM offers the vcpu_read_sys_reg() and
vcpu_write_sys_reg() primitives that always do the right thing.
However, these primitive must know what register to access, and
this is the role of the __vcpu_read_sys_reg_from_cpu() and
__vcpu_write_sys_reg_to_cpu() helpers.
As it turns out, TCR2_EL1, PIR_EL1, PIRE0_EL1 and not described
in the latter helpers, meaning that the AT code cannot use them
to emulate S1PIE.
Add the three registers to the (long) list.
Fixes: 86f9de9db178 ("KVM: arm64: Save/restore PIE registers")
Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: Joey Gouly <joey.gouly@arm.com>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20241023145345.1613824-9-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
TCR2_EL2 is a bag of control bits, all of which are only valid if
certain features are present, and RES0 otherwise.
Describe these constraints and register them with the masking
infrastructure.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://lore.kernel.org/r/20241023145345.1613824-13-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Whenever we need to restore the guest's system registers to the CPU, we
now need to take care of the EL2 system registers as well. Most of them
are accessed via traps only, but some have an immediate effect and also
a guest running in VHE mode would expect them to be accessible via their
EL1 encoding, which we do not trap.
For vEL2 we write the virtual EL2 registers with an identical format directly
into their EL1 counterpart, and translate the few registers that have a
different format for the same effect on the execution when running a
non-VHE guest guest hypervisor.
Based on an initial patch from Andre Przywara, rewritten many times
since.
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-8-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Add the TCR2_EL2 register to the per-vcpu sysreg register array,
the sysreg descriptor array, and advertise it as mapped to TCR2_EL1
for NV purposes.
Access to this register is conditional based on ID_AA64MMFR3_EL1.TCRX
being advertised.
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-12-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
This definition is missing, and we are going to need it to sanitise
TCR2_ELx.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-11-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Accessing CNTHCTL_EL2 is fraught with danger if running with
HCR_EL2.E2H=1: half of the bits are held in CNTKCTL_EL1, and
thus can be changed behind our back, while the rest lives
in the CNTHCTL_EL2 shadow copy that is memory-based.
Yes, this is a lot of fun!
Make sure that we merge the two on read access, while we can
write to CNTKCTL_EL1 in a more straightforward manner.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-7-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
As KVM has grown a bunch of new system register for NV, it appears
that we are missing them in the get_el2_to_el1_mapping() list.
Most of them are not crucial as they don't tend to be accessed via
vcpu_read_sys_reg() and vcpu_write_sys_reg().
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-6-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
__kvm_at_s1e2() contains the definition of an s2_mmu for the
current context, but doesn't make any use of it. Drop it.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-5-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
PIRE0_EL2 is the equivalent of PIRE0_EL1 for the EL2&0 translation
regime, and it is sorely missing from the sysreg file.
Add the sucker.
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-4-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
As of the ARM ARM Known Issues document 102105_K.a_04_en, D22677
fixes a problem with the PIRE0_EL2 register, resulting in its
removal from the VNCR page (it had no purpose being there the
first place).
Follow the architecture update by removing this offset.
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-3-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Despite what the documentation says, TCR2_EL2.{SKL0,SKL1} do not exist,
and the corresponding information is in the respective TTBRx_EL2. This
is a leftover from a development version of the architecture.
This change makes TCR2_EL2 similar to TCR2_EL1 in that respect.
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241023145345.1613824-2-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Add a small PtP driver which allows user space to get
the values of the physical and tod clock. This allows
programs like chrony to use STP as clock source and
steer the kernel clock. The physical clock can be used
as a debugging aid to get the clock without any additional
offsets like STP steering or LPAR offset.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Link: https://patch.msgid.link/20241023065601.449586-3-svens@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
To allow specifying the clock source in the upcoming PtP driver,
add a clocksource ID to the s390 TOD clock.
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Link: https://patch.msgid.link/20241023065601.449586-2-svens@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add support for bpf_arch_text_poke() and arch_prepare_bpf_trampoline()
for 64-bit powerpc. While the code is generic, BPF trampolines are only
enabled on 64-bit powerpc. 32-bit powerpc will need testing and some
updates.
BPF Trampolines adhere to the existing ftrace ABI utilizing a
two-instruction profiling sequence, as well as the newer ABI utilizing a
three-instruction profiling sequence enabling return with a 'blr'. The
trampoline code itself closely follows x86 implementation.
BPF prog JIT is extended to mimic 64-bit powerpc approach for ftrace
having a single nop at function entry, followed by the function
profiling sequence out-of-line and a separate long branch stub for calls
to trampolines that are out of range. A dummy_tramp is provided to
simplify synchronization similar to arm64.
When attaching a bpf trampoline to a bpf prog, we can patch up to three
things:
- the nop at bpf prog entry to go to the out-of-line stub
- the instruction in the out-of-line stub to either call the bpf trampoline
directly, or to branch to the long_branch stub.
- the trampoline address before the long_branch stub.
We do not need any synchronization here since we always have a valid
branch target regardless of the order in which the above stores are
seen. dummy_tramp ensures that the long_branch stub goes to a valid
destination on other cpus, even when the branch to the long_branch stub
is seen before the updated trampoline address.
However, when detaching a bpf trampoline from a bpf prog, or if changing
the bpf trampoline address, we need synchronization to ensure that other
cpus can no longer branch into the older trampoline so that it can be
safely freed. bpf_tramp_image_put() uses rcu_tasks to ensure all cpus
make forward progress, but we still need to ensure that other cpus
execute isync (or some CSI) so that they don't go back into the
trampoline again. While here, update the stale comment that describes
the redzone usage in ppc64 BPF JIT.
Signed-off-by: Naveen N Rao <naveen@kernel.org>
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://patch.msgid.link/20241030070850.1361304-18-hbathini@linux.ibm.com
|
|
Add powerpc 32-bit and 64-bit samples for ftrace direct. This serves to
show the sample instruction sequence to be used by ftrace direct calls
to adhere to the ftrace ABI.
On 64-bit powerpc, TOC setup requires some additional work.
Signed-off-by: Naveen N Rao <naveen@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://patch.msgid.link/20241030070850.1361304-17-hbathini@linux.ibm.com
|
|
Add support for DYNAMIC_FTRACE_WITH_DIRECT_CALLS similar to the arm64
implementation.
ftrace direct calls allow custom trampolines to be called into directly
from function ftrace call sites, bypassing the ftrace trampoline
completely. This functionality is currently utilized by BPF trampolines
to hook into kernel function entries.
Since we have limited relative branch range, we support ftrace direct
calls through support for DYNAMIC_FTRACE_WITH_CALL_OPS. In this
approach, ftrace trampoline is not entirely bypassed. Rather, it is
re-purposed into a stub that reads direct_call field from the associated
ftrace_ops structure and branches into that, if it is not NULL. For
this, it is sufficient if we can ensure that the ftrace trampoline is
reachable from all traceable functions.
When multiple ftrace_ops are associated with a call site, we utilize a
call back to set pt_regs->orig_gpr3 that can then be tested on the
return path from the ftrace trampoline to branch into the direct caller.
Signed-off-by: Naveen N Rao <naveen@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://patch.msgid.link/20241030070850.1361304-16-hbathini@linux.ibm.com
|
|
Implement support for DYNAMIC_FTRACE_WITH_CALL_OPS similar to the
arm64 implementation.
This works by patching-in a pointer to an associated ftrace_ops
structure before each traceable function. If multiple ftrace_ops are
associated with a call site, then a special ftrace_list_ops is used to
enable iterating over all the registered ftrace_ops. If no ftrace_ops
are associated with a call site, then a special ftrace_nop_ops structure
is used to render the ftrace call as a no-op. ftrace trampoline can then
read the associated ftrace_ops for a call site by loading from an offset
from the LR, and branch directly to the associated function.
The primary advantage with this approach is that we don't have to
iterate over all the registered ftrace_ops for call sites that have a
single ftrace_ops registered. This is the equivalent of implementing
support for dynamic ftrace trampolines, which set up a special ftrace
trampoline for each registered ftrace_ops and have individual call sites
branch into those directly.
A secondary advantage is that this gives us a way to add support for
direct ftrace callers without having to resort to using stubs. The
address of the direct call trampoline can be loaded from the ftrace_ops
structure.
To support this, we reserve a nop before each function on 32-bit
powerpc. For 64-bit powerpc, two nops are reserved before each
out-of-line stub. During ftrace activation, we update this location with
the associated ftrace_ops pointer. Then, on ftrace entry, we load from
this location and call into ftrace_ops->func().
For 64-bit powerpc, we ensure that the out-of-line stub area is
doubleword aligned so that ftrace_ops address can be updated atomically.
Signed-off-by: Naveen N Rao <naveen@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://patch.msgid.link/20241030070850.1361304-15-hbathini@linux.ibm.com
|
|
We are restricted to a .text size of ~32MB when using out-of-line
function profile sequence. Allow this to be extended up to the previous
limit of ~64MB by reserving space in the middle of .text.
A new config option CONFIG_PPC_FTRACE_OUT_OF_LINE_NUM_RESERVE is
introduced to specify the number of function stubs that are reserved in
.text. On boot, ftrace utilizes stubs from this area first before using
the stub area at the end of .text.
A ppc64le defconfig has ~44k functions that can be traced. A more
conservative value of 32k functions is chosen as the default value of
PPC_FTRACE_OUT_OF_LINE_NUM_RESERVE so that we do not allot more space
than necessary by default. If building a kernel that only has 32k
trace-able functions, we won't allot any more space at the end of .text
during the pass on vmlinux.o. Otherwise, only the remaining functions
get space for stubs at the end of .text. This default value should help
cover a .text size of ~48MB in total (including space reserved at the
end of .text which can cover up to 32MB), which should be sufficient for
most common builds. For a very small kernel build, this can be set to 0.
Or, this can be bumped up to a larger value to support vmlinux .text
size up to ~64MB.
Signed-off-by: Naveen N Rao <naveen@kernel.org>
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://patch.msgid.link/20241030070850.1361304-14-hbathini@linux.ibm.com
|
|
Function profile sequence on powerpc includes two instructions at the
beginning of each function:
mflr r0
bl ftrace_caller
The call to ftrace_caller() gets nop'ed out during kernel boot and is
patched in when ftrace is enabled.
Given the sequence, we cannot return from ftrace_caller with 'blr' as we
need to keep LR and r0 intact. This results in link stack (return
address predictor) imbalance when ftrace is enabled. To address that, we
would like to use a three instruction sequence:
mflr r0
bl ftrace_caller
mtlr r0
Further more, to support DYNAMIC_FTRACE_WITH_CALL_OPS, we need to
reserve two instruction slots before the function. This results in a
total of five instruction slots to be reserved for ftrace use on each
function that is traced.
Move the function profile sequence out-of-line to minimize its impact.
To do this, we reserve a single nop at function entry using
-fpatchable-function-entry=1 and add a pass on vmlinux.o to determine
the total number of functions that can be traced. This is then used to
generate a .S file reserving the appropriate amount of space for use as
ftrace stubs, which is built and linked into vmlinux.
On bootup, the stub space is split into separate stubs per function and
populated with the proper instruction sequence. A pointer to the
associated stub is maintained in dyn_arch_ftrace.
For modules, space for ftrace stubs is reserved from the generic module
stub space.
This is restricted to and enabled by default only on 64-bit powerpc,
though there are some changes to accommodate 32-bit powerpc. This is
done so that 32-bit powerpc could choose to opt into this based on
further tests and benchmarks.
As an example, after this patch, kernel functions will have a single nop
at function entry:
<kernel_clone>:
addis r2,r12,467
addi r2,r2,-16028
nop
mfocrf r11,8
...
When ftrace is enabled, the nop is converted to an unconditional branch
to the stub associated with that function:
<kernel_clone>:
addis r2,r12,467
addi r2,r2,-16028
b ftrace_ool_stub_text_end+0x11b28
mfocrf r11,8
...
The associated stub:
<ftrace_ool_stub_text_end+0x11b28>:
mflr r0
bl ftrace_caller
mtlr r0
b kernel_clone+0xc
...
This change showed an improvement of ~10% in null_syscall benchmark on a
Power 10 system with ftrace enabled.
Signed-off-by: Naveen N Rao <naveen@kernel.org>
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://patch.msgid.link/20241030070850.1361304-13-hbathini@linux.ibm.com
|
|
On powerpc, we would like to be able to make a pass on vmlinux.o and
generate a new object file to be linked into vmlinux. Add a generic pass
in Makefile.vmlinux that architectures can use for this purpose.
Architectures need to select CONFIG_ARCH_WANTS_PRE_LINK_VMLINUX and must
provide arch/<arch>/tools/Makefile with .arch.vmlinux.o target, which
will be invoked prior to the final vmlinux link step.
Acked-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Naveen N Rao <naveen@kernel.org>
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://patch.msgid.link/20241030070850.1361304-12-hbathini@linux.ibm.com
|
|
Function tracer on powerpc can only work with vmlinux having a .text
size of up to ~64MB due to powerpc branch instruction having a limited
relative branch range of 32MB. Today, this is only detected on kernel
boot when ftrace is init'ed. Add a post-link script to check the size of
.text so that we can detect this at build time, and break the build if
necessary.
We add a dependency on !COMPILE_TEST for CONFIG_HAVE_FUNCTION_TRACER so
that allyesconfig and other test builds can continue to work without
enabling ftrace.
Signed-off-by: Naveen N Rao <naveen@kernel.org>
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://patch.msgid.link/20241030070850.1361304-11-hbathini@linux.ibm.com
|