|
v2->v1:
1. Remove the simuation of STP and the related bits.
2. Use arm64_skip_faulting_instruction for single-stepping or FEAT_BTI
scenario.
As Andrii pointed out, the uprobe/uretprobe selftest bench run into a
counterintuitive result that nop and push variants are much slower than
ret variant [0]. The root cause lies in the arch_probe_analyse_insn(),
which excludes 'nop' and 'stp' from the emulatable instructions list.
This force the kernel returns to userspace and execute them out-of-line,
then trapping back to kernel for running uprobe callback functions. This
leads to a significant performance overhead compared to 'ret' variant,
which is already emulated.
Typicall uprobe is installed on 'nop' for USDT and on function entry
which starts with the instrucion 'stp x29, x30, [sp, #imm]!' to push lr
and fp into stack regardless kernel or userspace binary. In order to
improve the performance of handling uprobe for common usecases. This
patch supports the emulation of Arm64 equvialents instructions of 'nop'
and 'push'. The benchmark results below indicates the performance gain
of emulation is obvious.
On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz.
xol (1 cpus)
------------
uprobe-nop: 0.916 ± 0.001M/s (0.916M/prod)
uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
uprobe-ret: 1.855 ± 0.000M/s (1.855M/prod)
uretprobe-nop: 0.640 ± 0.000M/s (0.640M/prod)
uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
uretprobe-ret: 0.978 ± 0.003M/s (0.978M/prod)
emulation (1 cpus)
-------------------
uprobe-nop: 1.862 ± 0.002M/s (1.862M/prod)
uprobe-push: 1.743 ± 0.006M/s (1.743M/prod)
uprobe-ret: 1.840 ± 0.001M/s (1.840M/prod)
uretprobe-nop: 0.964 ± 0.004M/s (0.964M/prod)
uretprobe-push: 0.936 ± 0.004M/s (0.936M/prod)
uretprobe-ret: 0.940 ± 0.001M/s (0.940M/prod)
As shown above, the performance gap between 'nop/push' and 'ret'
variants has been significantly reduced. Due to the emulation of 'push'
instruction needs to access userspace memory, it spent more cycles than
the other.
As Mark suggested [1], it is painful to emulate the correct atomicity
and ordering properties of STP, especially when it interacts with MTE,
POE, etc. So this patch just focus on the simuation of 'nop'. The
simluation of STP and related changes will be addressed in a separate
patch.
[0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@mail.gmail.com/
[1] https://lore.kernel.org/all/Zr3RN4zxF5XPgjEB@J2N7QTR9R3/
CC: Andrii Nakryiko <andrii.nakryiko@gmail.com>
CC: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Liao Chang <liaochang1@huawei.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20240909071114.1150053-1-liaochang1@huawei.com
[catalin.marinas@arm.com: small tweaks following MarkR's comments]
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
Kprobes needs simulation of instructions that cannot be stepped
from a different memory location, e.g.: those instructions
that uses PC-relative addressing. In simulation, the behaviour
of the instruction is implemented using a copy of pt_regs.
The following instruction categories are simulated:
- All branching instructions(conditional, register, and immediate)
- Literal access instructions(load-literal, adr/adrp)
Conditional execution is limited to branching instructions in
ARM v8. If conditions at PSTATE do not match the condition fields
of opcode, the instruction is effectively NOP.
Thanks to Will Cohen for assorted suggested changes.
Signed-off-by: Sandeepa Prabhu <sandeepa.s.prabhu@gmail.com>
Signed-off-by: William Cohen <wcohen@redhat.com>
Signed-off-by: David A. Long <dave.long@linaro.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
[catalin.marinas@arm.com: removed linux/module.h include]
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|