From baaf553d3bc330697c68a00f96cf11f4edfeac7e Mon Sep 17 00:00:00 2001 From: Mark Rutland Date: Mon, 23 Jan 2023 13:46:03 +0000 Subject: arm64: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS This patch enables support for DYNAMIC_FTRACE_WITH_CALL_OPS on arm64. This allows each ftrace callsite to provide an ftrace_ops to the common ftrace trampoline, allowing each callsite to invoke distinct tracer functions without the need to fall back to list processing or to allocate custom trampolines for each callsite. This significantly speeds up cases where multiple distinct trace functions are used and callsites are mostly traced by a single tracer. The main idea is to place a pointer to the ftrace_ops as a literal at a fixed offset from the function entry point, which can be recovered by the common ftrace trampoline. Using a 64-bit literal avoids branch range limitations, and permits the ops to be swapped atomically without special considerations that apply to code-patching. In future this will also allow for the implementation of DYNAMIC_FTRACE_WITH_DIRECT_CALLS without branch range limitations by using additional fields in struct ftrace_ops. As noted in the core patch adding support for DYNAMIC_FTRACE_WITH_CALL_OPS, this approach allows for directly invoking ftrace_ops::func even for ftrace_ops which are dynamically-allocated (or part of a module), without going via ftrace_ops_list_func. Currently, this approach is not compatible with CLANG_CFI, as the presence/absence of pre-function NOPs changes the offset of the pre-function type hash, and there's no existing mechanism to ensure a consistent offset for instrumented and uninstrumented functions. When CLANG_CFI is enabled, the existing scheme with a global ops->func pointer is used, and there should be no functional change. I am currently working with others to allow the two to work together in future (though this will liekly require updated compiler support). I've benchamrked this with the ftrace_ops sample module [1], which is not currently upstream, but available at: https://lore.kernel.org/lkml/20230103124912.2948963-1-mark.rutland@arm.com git://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git ftrace-ops-sample-20230109 Using that module I measured the total time taken for 100,000 calls to a trivial instrumented function, with a number of tracers enabled with relevant filters (which would apply to the instrumented function) and a number of tracers enabled with irrelevant filters (which would not apply to the instrumented function). I tested on an M1 MacBook Pro, running under a HVF-accelerated QEMU VM (i.e. on real hardware). Before this patch: Number of tracers || Total time | Per-call average time (ns) Relevant | Irrelevant || (ns) | Total | Overhead =========+============++=============+==============+============ 0 | 0 || 94,583 | 0.95 | - 0 | 1 || 93,709 | 0.94 | - 0 | 2 || 93,666 | 0.94 | - 0 | 10 || 93,709 | 0.94 | - 0 | 100 || 93,792 | 0.94 | - ---------+------------++-------------+--------------+------------ 1 | 1 || 6,467,833 | 64.68 | 63.73 1 | 2 || 7,509,708 | 75.10 | 74.15 1 | 10 || 23,786,792 | 237.87 | 236.92 1 | 100 || 106,432,500 | 1,064.43 | 1063.38 ---------+------------++-------------+--------------+------------ 1 | 0 || 1,431,875 | 14.32 | 13.37 2 | 0 || 6,456,334 | 64.56 | 63.62 10 | 0 || 22,717,000 | 227.17 | 226.22 100 | 0 || 103,293,667 | 1032.94 | 1031.99 ---------+------------++-------------+--------------+-------------- Note: per-call overhead is estimated relative to the baseline case with 0 relevant tracers and 0 irrelevant tracers. After this patch Number of tracers || Total time | Per-call average time (ns) Relevant | Irrelevant || (ns) | Total | Overhead =========+============++=============+==============+============ 0 | 0 || 94,541 | 0.95 | - 0 | 1 || 93,666 | 0.94 | - 0 | 2 || 93,709 | 0.94 | - 0 | 10 || 93,667 | 0.94 | - 0 | 100 || 93,792 | 0.94 | - ---------+------------++-------------+--------------+------------ 1 | 1 || 281,000 | 2.81 | 1.86 1 | 2 || 281,042 | 2.81 | 1.87 1 | 10 || 280,958 | 2.81 | 1.86 1 | 100 || 281,250 | 2.81 | 1.87 ---------+------------++-------------+--------------+------------ 1 | 0 || 280,959 | 2.81 | 1.86 2 | 0 || 6,502,708 | 65.03 | 64.08 10 | 0 || 18,681,209 | 186.81 | 185.87 100 | 0 || 103,550,458 | 1,035.50 | 1034.56 ---------+------------++-------------+--------------+------------ Note: per-call overhead is estimated relative to the baseline case with 0 relevant tracers and 0 irrelevant tracers. As can be seen from the above: a) Whenever there is a single relevant tracer function associated with a tracee, the overhead of invoking the tracer is constant, and does not scale with the number of tracers which are *not* associated with that tracee. b) The overhead for a single relevant tracer has dropped to ~1/7 of the overhead prior to this series (from 13.37ns to 1.86ns). This is largely due to permitting calls to dynamically-allocated ftrace_ops without going through ftrace_ops_list_func. I've run the ftrace selftests from v6.2-rc3, which reports: | # of passed: 110 | # of failed: 0 | # of unresolved: 3 | # of untested: 0 | # of unsupported: 0 | # of xfailed: 1 | # of undefined(test bug): 0 ... where the unresolved entries were the tests for DIRECT functions (which are not supported), and the checkbashisms selftest (which is irrelevant here): | [8] Test ftrace direct functions against tracers [UNRESOLVED] | [9] Test ftrace direct functions against kprobes [UNRESOLVED] | [62] Meta-selftest: Checkbashisms [UNRESOLVED] ... with all other tests passing (or failing as expected). Signed-off-by: Mark Rutland Cc: Florent Revest Cc: Masami Hiramatsu Cc: Peter Zijlstra Cc: Steven Rostedt Cc: Will Deacon Link: https://lore.kernel.org/r/20230123134603.1064407-9-mark.rutland@arm.com Signed-off-by: Catalin Marinas --- arch/arm64/Makefile | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) (limited to 'arch/arm64/Makefile') diff --git a/arch/arm64/Makefile b/arch/arm64/Makefile index d62bd221828f..4c3be442fbb3 100644 --- a/arch/arm64/Makefile +++ b/arch/arm64/Makefile @@ -139,7 +139,10 @@ endif CHECKFLAGS += -D__aarch64__ -ifeq ($(CONFIG_DYNAMIC_FTRACE_WITH_ARGS),y) +ifeq ($(CONFIG_DYNAMIC_FTRACE_WITH_CALL_OPS),y) + KBUILD_CPPFLAGS += -DCC_USING_PATCHABLE_FUNCTION_ENTRY + CC_FLAGS_FTRACE := -fpatchable-function-entry=4,2 +else ifeq ($(CONFIG_DYNAMIC_FTRACE_WITH_ARGS),y) KBUILD_CPPFLAGS += -DCC_USING_PATCHABLE_FUNCTION_ENTRY CC_FLAGS_FTRACE := -fpatchable-function-entry=2 endif -- cgit From 1e249c41ea43157fc69e1496e3b4feea999f107a Mon Sep 17 00:00:00 2001 From: Mark Rutland Date: Tue, 31 Jan 2023 10:58:08 +0000 Subject: arm64: unify asm-arch manipulation Assemblers will reject instructions not supported by a target architecture version, and so we must explicitly tell the assembler the latest architecture version for which we want to assemble instructions from. We've added a few AS_HAS_ARMV8_ definitions for this, in addition to an inconsistently named AS_HAS_PAC definition, from which arm64's top-level Makefile determines the architecture version that we intend to target, and generates the `asm-arch` variable. To make this a bit clearer and easier to maintain, this patch reworks the Makefile to determine asm-arch in a single if-else-endif chain. AS_HAS_PAC, which is defined when the assembler supports `-march=armv8.3-a`, is renamed to AS_HAS_ARMV8_3. As the logic for armv8.3-a is lifted out of the block handling pointer authentication, `asm-arch` may now be set to armv8.3-a regardless of whether support for pointer authentication is selected. This means that it will be possible to assemble armv8.3-a instructions even if we didn't intend to, but this is consistent with our handling of other architecture versions, and the compiler won't generate armv8.3-a instructions regardless. For the moment there's no need for an CONFIG_AS_HAS_ARMV8_1, as the code for LSE atomics and LDAPR use individual `.arch_extension` entries and do not require the baseline asm arch to be bumped to armv8.1-a. The other armv8.1-a features (e.g. PAN) do not require assembler support. There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland Reviewed-by: Ard Biesheuvel Reviewed-by: Mark Brown Cc: Will Deacon Link: https://lore.kernel.org/r/20230131105809.991288-2-mark.rutland@arm.com Signed-off-by: Catalin Marinas --- arch/arm64/Makefile | 37 ++++++++++++++++++------------------- 1 file changed, 18 insertions(+), 19 deletions(-) (limited to 'arch/arm64/Makefile') diff --git a/arch/arm64/Makefile b/arch/arm64/Makefile index d62bd221828f..e176eb76345b 100644 --- a/arch/arm64/Makefile +++ b/arch/arm64/Makefile @@ -63,11 +63,6 @@ stack_protector_prepare: prepare0 include/generated/asm-offsets.h)) endif -ifeq ($(CONFIG_AS_HAS_ARMV8_2), y) -# make sure to pass the newest target architecture to -march. -asm-arch := armv8.2-a -endif - # Ensure that if the compiler supports branch protection we default it # off, this will be overridden if we are using branch protection. branch-prot-flags-y += $(call cc-option,-mbranch-protection=none) @@ -88,25 +83,29 @@ branch-prot-flags-$(CONFIG_CC_HAS_BRANCH_PROT_PAC_RET_BTI) := -mbranch-protectio else branch-prot-flags-$(CONFIG_CC_HAS_BRANCH_PROT_PAC_RET) := -mbranch-protection=$(PACRET-y) endif -# -march=armv8.3-a enables the non-nops instructions for PAC, to avoid the -# compiler to generate them and consequently to break the single image contract -# we pass it only to the assembler. This option is utilized only in case of non -# integrated assemblers. -ifeq ($(CONFIG_AS_HAS_PAC), y) -asm-arch := armv8.3-a -endif endif KBUILD_CFLAGS += $(branch-prot-flags-y) -ifeq ($(CONFIG_AS_HAS_ARMV8_4), y) -# make sure to pass the newest target architecture to -march. -asm-arch := armv8.4-a -endif - +# Tell the assembler to support instructions from the latest target +# architecture. +# +# For non-integrated assemblers we'll pass this on the command line, and for +# integrated assemblers we'll define ARM64_ASM_ARCH and ARM64_ASM_PREAMBLE for +# inline usage. +# +# We cannot pass the same arch flag to the compiler as this would allow it to +# freely generate instructions which are not supported by earlier architecture +# versions, which would prevent a single kernel image from working on earlier +# hardware. ifeq ($(CONFIG_AS_HAS_ARMV8_5), y) -# make sure to pass the newest target architecture to -march. -asm-arch := armv8.5-a + asm-arch := armv8.5-a +else ifeq ($(CONFIG_AS_HAS_ARMV8_4), y) + asm-arch := armv8.4-a +else ifeq ($(CONFIG_AS_HAS_ARMV8_3), y) + asm-arch := armv8.3-a +else ifeq ($(CONFIG_AS_HAS_ARMV8_2), y) + asm-arch := armv8.2-a endif ifdef asm-arch -- cgit From c68cf5285e1896a2b725ec01a1351f08610165b8 Mon Sep 17 00:00:00 2001 From: Mark Rutland Date: Tue, 31 Jan 2023 10:58:09 +0000 Subject: arm64: pauth: don't sign leaf functions Currently, when CONFIG_ARM64_PTR_AUTH_KERNEL=y (and CONFIG_UNWIND_PATCH_PAC_INTO_SCS=n), we enable pointer authentication for all functions, including leaf functions. This isn't necessary, and is unfortunate for a few reasons: * Any PACIASP instruction is implicitly a `BTI C` landing pad, and forcing the addition of a PACIASP in every function introduces a larger set of BTI gadgets than is necessary. * The PACIASP and AUTIASP instructions make leaf functions larger than necessary, bloating the kernel Image. For a defconfig v6.2-rc3 kernel, this appears to add ~64KiB relative to not signing leaf functions, which is unfortunate but not entirely onerous. * The PACIASP and AUTIASP instructions potentially make leaf functions more expensive in terms of performance and/or power. For many trivial leaf functions, this is clearly unnecessary, e.g. | : | d503233f paciasp | d53b4220 mrs x0, daif | d50323bf autiasp | d65f03c0 ret | : | d503233f paciasp | d50323bf autiasp | d65f03c0 ret | d503201f nop * When CONFIG_UNWIND_PATCH_PAC_INTO_SCS=y we disable pointer authentication for leaf functions, so clearly this is not functionally necessary, indicates we have an inconsistent threat model, and convolutes the Makefile logic. We've used pointer authentication in leaf functions since the introduction of in-kernel pointer authentication in commit: 74afda4016a7437e ("arm64: compile the kernel with ptrauth return address signing") ... but at the time we had no rationale for signing leaf functions. Subsequently, we considered avoiding signing leaf functions: https://lore.kernel.org/linux-arm-kernel/1586856741-26839-1-git-send-email-amit.kachhap@arm.com/ https://lore.kernel.org/linux-arm-kernel/1588149371-20310-1-git-send-email-amit.kachhap@arm.com/ ... however at the time we didn't have an abundance of reasons to avoid signing leaf functions as above (e.g. the BTI case), we had no hardware to make performance measurements, and it was reasoned that this gave some level of protection against a limited set of code-reuse gadgets which would fall through to a RET. We documented this in commit: 717b938e22f8dbf0 ("arm64: Document why we enable PAC support for leaf functions") Notably, this was before we supported any forward-edge CFI scheme (e.g. Arm BTI, or Clang CFI/kCFI), which would prevent jumping into the middle of a function. In addition, even with signing forced for leaf functions, AUTIASP may be placed before a number of instructions which might constitute such a gadget, e.g. | : | f9400022 ldr x2, [x1] | d503233f paciasp | d50323bf autiasp | f9408401 ldr x1, [x0, #264] | 720b005f tst w2, #0x200000 | b26b0022 orr x2, x1, #0x200000 | 926af821 and x1, x1, #0xffffffffffdfffff | 9a820021 csel x1, x1, x2, eq // eq = none | f9008401 str x1, [x0, #264] | d65f03c0 ret | : | 2a0003e3 mov w3, w0 | 9000ff42 adrp x2, ffff800009ffd000 | 9120e042 add x2, x2, #0x838 | 52800000 mov w0, #0x0 // #0 | d503233f paciasp | f000d041 adrp x1, ffff800009a20000 | d50323bf autiasp | 9102c021 add x1, x1, #0xb0 | f8635842 ldr x2, [x2, w3, uxtw #3] | f821685f str xzr, [x2, x1] | d65f03c0 ret | d503201f nop So generally, trying to use AUTIASP to detect such gadgetization is not robust, and this is dealt with far better by forward-edge CFI (which is designed to prevent such cases). We should bite the bullet and stop pretending that AUTIASP is a mitigation for such forward-edge gadgetization. For the above reasons, this patch has the kernel consistently sign non-leaf functions and avoid signing leaf functions. Considering a defconfig v6.2-rc3 kernel built with LLVM 15.0.6: * The vmlinux is ~43KiB smaller: | [mark@lakrids:~/src/linux]% ls -al vmlinux-* | -rwxr-xr-x 1 mark mark 338547808 Jan 25 17:17 vmlinux-after | -rwxr-xr-x 1 mark mark 338591472 Jan 25 17:22 vmlinux-before * The resulting Image is 64KiB smaller: | [mark@lakrids:~/src/linux]% ls -al Image-* | -rwxr-xr-x 1 mark mark 32702976 Jan 25 17:17 Image-after | -rwxr-xr-x 1 mark mark 32768512 Jan 25 17:22 Image-before * There are ~400 fewer BTI gadgets: | [mark@lakrids:~/src/linux]% usekorg 12.1.0 aarch64-linux-objdump -d vmlinux-before 2> /dev/null | grep -ow 'paciasp\|bti\sc\?' | sort | uniq -c | 1219 bti c | 61982 paciasp | [mark@lakrids:~/src/linux]% usekorg 12.1.0 aarch64-linux-objdump -d vmlinux-after 2> /dev/null | grep -ow 'paciasp\|bti\sc\?' | sort | uniq -c | 10099 bti c | 52699 paciasp Which is +8880 BTIs, and -9283 PACIASPs, for -403 unnecessary BTI gadgets. While this is small relative to the total, distinguishing the two cases will make it easier to analyse and reduce this set further in future. Signed-off-by: Mark Rutland Reviewed-by: Ard Biesheuvel Reviewed-by: Mark Brown Cc: Amit Daniel Kachhap Cc: Will Deacon Link: https://lore.kernel.org/r/20230131105809.991288-3-mark.rutland@arm.com Signed-off-by: Catalin Marinas --- arch/arm64/Makefile | 28 ++++++++-------------------- 1 file changed, 8 insertions(+), 20 deletions(-) (limited to 'arch/arm64/Makefile') diff --git a/arch/arm64/Makefile b/arch/arm64/Makefile index e176eb76345b..ab1f12b4f339 100644 --- a/arch/arm64/Makefile +++ b/arch/arm64/Makefile @@ -63,30 +63,18 @@ stack_protector_prepare: prepare0 include/generated/asm-offsets.h)) endif -# Ensure that if the compiler supports branch protection we default it -# off, this will be overridden if we are using branch protection. -branch-prot-flags-y += $(call cc-option,-mbranch-protection=none) - -ifeq ($(CONFIG_ARM64_PTR_AUTH_KERNEL),y) -branch-prot-flags-$(CONFIG_CC_HAS_SIGN_RETURN_ADDRESS) := -msign-return-address=all -# We enable additional protection for leaf functions as there is some -# narrow potential for ROP protection benefits and no substantial -# performance impact has been observed. -PACRET-y := pac-ret+leaf - -# Using a shadow call stack in leaf functions is too costly, so avoid PAC there -# as well when we may be patching PAC into SCS -PACRET-$(CONFIG_UNWIND_PATCH_PAC_INTO_SCS) := pac-ret - ifeq ($(CONFIG_ARM64_BTI_KERNEL),y) -branch-prot-flags-$(CONFIG_CC_HAS_BRANCH_PROT_PAC_RET_BTI) := -mbranch-protection=$(PACRET-y)+bti + KBUILD_CFLAGS += -mbranch-protection=pac-ret+bti +else ifeq ($(CONFIG_ARM64_PTR_AUTH_KERNEL),y) + ifeq ($(CONFIG_CC_HAS_BRANCH_PROT_PAC_RET),y) + KBUILD_CFLAGS += -mbranch-protection=pac-ret + else + KBUILD_CFLAGS += -msign-return-address=non-leaf + endif else -branch-prot-flags-$(CONFIG_CC_HAS_BRANCH_PROT_PAC_RET) := -mbranch-protection=$(PACRET-y) -endif + KBUILD_CFLAGS += $(call cc-option,-mbranch-protection=none) endif -KBUILD_CFLAGS += $(branch-prot-flags-y) - # Tell the assembler to support instructions from the latest target # architecture. # -- cgit