Merge branch 'bpf: introduce bpf_get_branch_snapshot'

Song Liu says: ==================== Changes v6 => v7: 1. Improve/fix intel_pmu_snapshot_branch_stack() logic. (Peter). Changes v5 => v6: 1. Add local_irq_save/restore to intel_pmu_snapshot_branch_stack. (Peter) 2. Remove buf and size check in bpf_get_branch_snapshot, move flags check to later fo the function. (Peter, Andrii) 3. Revise comments for bpf_get_branch_snapshot in bpf.h (Andrii) Changes v4 => v5: 1. Modify perf_snapshot_branch_stack_t to save some memcpy. (Andrii) 2. Minor fixes in selftests. (Andrii) Changes v3 => v4: 1. Do not reshuffle intel_pmu_disable_all(). Use some inline to save LBR entries. (Peter) 2. Move static_call(perf_snapshot_branch_stack) to the helper. (Alexei) 3. Add argument flags to bpf_get_branch_snapshot. (Andrii) 4. Make MAX_BRANCH_SNAPSHOT an enum (Andrii). And rename it as PERF_MAX_BRANCH_SNAPSHOT 5. Make bpf_get_branch_snapshot similar to bpf_read_branch_records. (Andrii) 6. Move the test target function to bpf_testmod. Updated kallsyms_find_next to work properly with modules. (Andrii) Changes v2 => v3: 1. Fix the use of static_call. (Peter) 2. Limit the use to perfmon version >= 2. (Peter) 3. Modify intel_pmu_snapshot_branch_stack() to use intel_pmu_disable_all and intel_pmu_enable_all(). Changes v1 => v2: 1. Rename the helper as bpf_get_branch_snapshot; 2. Fix/simplify the use of static_call; 3. Instead of percpu variables, let intel_pmu_snapshot_branch_stack output branch records to an output argument of type perf_branch_snapshot. Branch stack can be very useful in understanding software events. For example, when a long function, e.g. sys_perf_event_open, returns an errno, it is not obvious why the function failed. Branch stack could provide very helpful information in this type of scenarios. This set adds support to read branch stack with a new BPF helper bpf_get_branch_trace(). Currently, this is only supported in Intel systems. It is also possible to support the same feaure for PowerPC. The hardware that records the branch stace is not stopped automatically on software events. Therefore, it is necessary to stop it in software soon. Otherwise, the hardware buffers/registers will be flushed. One of the key design consideration in this set is to minimize the number of branch record entries between the event triggers and the hardware recorder is stopped. Based on this goal, current design is different from the discussions in original RFC [1]: 1) Static call is used when supported, to save function pointer dereference; 2) intel_pmu_lbr_disable_all is used instead of perf_pmu_disable(), because the latter uses about 10 entries before stopping LBR. With current code, on Intel CPU, LBR is stopped after 7 branch entries after fexit triggers: ID: 0 from bpf_get_branch_snapshot+18 to intel_pmu_snapshot_branch_stack+0 ID: 1 from __brk_limit+477143934 to bpf_get_branch_snapshot+0 ID: 2 from __brk_limit+477192263 to __brk_limit+477143880 # trampoline ID: 3 from __bpf_prog_enter+34 to __brk_limit+477192251 ID: 4 from migrate_disable+60 to __bpf_prog_enter+9 ID: 5 from __bpf_prog_enter+4 to migrate_disable+0 ID: 6 from bpf_testmod_loop_test+20 to __bpf_prog_enter+0 ID: 7 from bpf_testmod_loop_test+20 to bpf_testmod_loop_test+13 ID: 8 from bpf_testmod_loop_test+20 to bpf_testmod_loop_test+13 ... [1] https://lore.kernel.org/bpf/20210818012937.2522409-1-songliubraving@fb.com/ ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
author: Alexei Starovoitov <ast@kernel.org> 2021-09-13 10:53:50 -0700
committer: Alexei Starovoitov <ast@kernel.org> 2021-09-13 10:53:51 -0700
commit: 14bef1ab30378467385623c7ff8fccad570c4a13 (patch)
tree: 2d03c43e8c339f5b95a134cfe802862b65d45d8b /tools
parent: 3384c7c7641b44987e35eadbc9df6c16a0520159 (diff)
parent: 025bd7c753aab18cd594924a46ab46ac47209df9 (diff)
10 files changed, 265 insertions, 52 deletions
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 51cfd91cc387..d21326558d42 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -4877,6 +4877,27 @@ union bpf_attr {
  *		Get the struct pt_regs associated with **task**.
  *	Return
  *		A pointer to struct pt_regs.
+ *
+ * long bpf_get_branch_snapshot(void *entries, u32 size, u64 flags)
+ *	Description
+ *		Get branch trace from hardware engines like Intel LBR. The
+ *		hardware engine is stopped shortly after the helper is
+ *		called. Therefore, the user need to filter branch entries
+ *		based on the actual use case. To capture branch trace
+ *		before the trigger point of the BPF program, the helper
+ *		should be called at the beginning of the BPF program.
+ *
+ *		The data is stored as struct perf_branch_entry into output
+ *		buffer *entries*. *size* is the size of *entries* in bytes.
+ *		*flags* is reserved for now and must be zero.
+ *
+ *	Return
+ *		On success, number of bytes written to *buf*. On error, a
+ *		negative value.
+ *
+ *		**-EINVAL** if *flags* is not zero.
+ *
+ *		**-ENOENT** if architecture does not support branch records.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5055,6 +5076,7 @@ union bpf_attr {
 	FN(get_func_ip),		\
 	FN(get_attach_cookie),		\
 	FN(task_pt_regs),		\
+	FN(get_branch_snapshot),	\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
index 141d8da687d2..50fc5561110a 100644
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
@@ -13,6 +13,18 @@
 
 DEFINE_PER_CPU(int, bpf_testmod_ksym_percpu) = 123;
 
+noinline int bpf_testmod_loop_test(int n)
+{
+	int i, sum = 0;
+
+	/* the primary goal of this test is to test LBR. Create a lot of
+	 * branches in the function, so we can catch it easily.
+	 */
+	for (i = 0; i < n; i++)
+		sum += i;
+	return sum;
+}
+
 noinline ssize_t
 bpf_testmod_test_read(struct file *file, struct kobject *kobj,
 		      struct bin_attribute *bin_attr,
@@ -24,7 +36,11 @@ bpf_testmod_test_read(struct file *file, struct kobject *kobj,
 		.len = len,
 	};
 
-	trace_bpf_testmod_test_read(current, &ctx);
+	/* This is always true. Use the check to make sure the compiler
+	 * doesn't remove bpf_testmod_loop_test.
+	 */
+	if (bpf_testmod_loop_test(101) > 100)
+		trace_bpf_testmod_test_read(current, &ctx);
 
 	return -EIO; /* always fail */
 }
@@ -71,4 +87,3 @@ module_exit(bpf_testmod_exit);
 MODULE_AUTHOR("Andrii Nakryiko");
 MODULE_DESCRIPTION("BPF selftests module");
 MODULE_LICENSE("Dual BSD/GPL");
-
diff --git a/tools/testing/selftests/bpf/prog_tests/core_reloc.c b/tools/testing/selftests/bpf/prog_tests/core_reloc.c
index 4739b15b2a97..15d355af8d1d 100644
--- a/tools/testing/selftests/bpf/prog_tests/core_reloc.c
+++ b/tools/testing/selftests/bpf/prog_tests/core_reloc.c
@@ -30,7 +30,7 @@ static int duration = 0;
 	.output_len = sizeof(struct core_reloc_module_output),		\
 	.prog_sec_name = sec_name,					\
 	.raw_tp_name = tp_name,						\
-	.trigger = trigger_module_test_read,				\
+	.trigger = __trigger_module_test_read,				\
 	.needs_testmod = true,						\
 }
 
@@ -475,19 +475,11 @@ static int setup_type_id_case_failure(struct core_reloc_test_case *test)
 	return 0;
 }
 
-static int trigger_module_test_read(const struct core_reloc_test_case *test)
+static int __trigger_module_test_read(const struct core_reloc_test_case *test)
 {
 	struct core_reloc_module_output *exp = (void *)test->output;
-	int fd, err;
-
-	fd = open("/sys/kernel/bpf_testmod", O_RDONLY);
-	err = -errno;
-	if (CHECK(fd < 0, "testmod_file_open", "failed: %d\n", err))
-		return err;
-
-	read(fd, NULL, exp->len); /* request expected number of bytes */
-	close(fd);
 
+	trigger_module_test_read(exp->len);
 	return 0;
 }
 
diff --git a/tools/testing/selftests/bpf/prog_tests/get_branch_snapshot.c b/tools/testing/selftests/bpf/prog_tests/get_branch_snapshot.c
new file mode 100644
index 000000000000..f81db9135ae4
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/get_branch_snapshot.c
@@ -0,0 +1,100 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021 Facebook */
+#include <test_progs.h>
+#include "get_branch_snapshot.skel.h"
+
+static int *pfd_array;
+static int cpu_cnt;
+
+static int create_perf_events(void)
+{
+	struct perf_event_attr attr = {0};
+	int cpu;
+
+	/* create perf event */
+	attr.size = sizeof(attr);
+	attr.type = PERF_TYPE_RAW;
+	attr.config = 0x1b00;
+	attr.sample_type = PERF_SAMPLE_BRANCH_STACK;
+	attr.branch_sample_type = PERF_SAMPLE_BRANCH_KERNEL |
+		PERF_SAMPLE_BRANCH_USER | PERF_SAMPLE_BRANCH_ANY;
+
+	cpu_cnt = libbpf_num_possible_cpus();
+	pfd_array = malloc(sizeof(int) * cpu_cnt);
+	if (!pfd_array) {
+		cpu_cnt = 0;
+		return 1;
+	}
+
+	for (cpu = 0; cpu < cpu_cnt; cpu++) {
+		pfd_array[cpu] = syscall(__NR_perf_event_open, &attr,
+					 -1, cpu, -1, PERF_FLAG_FD_CLOEXEC);
+		if (pfd_array[cpu] < 0)
+			break;
+	}
+
+	return cpu == 0;
+}
+
+static void close_perf_events(void)
+{
+	int cpu = 0;
+	int fd;
+
+	while (cpu++ < cpu_cnt) {
+		fd = pfd_array[cpu];
+		if (fd < 0)
+			break;
+		close(fd);
+	}
+	free(pfd_array);
+}
+
+void test_get_branch_snapshot(void)
+{
+	struct get_branch_snapshot *skel = NULL;
+	int err;
+
+	if (create_perf_events()) {
+		test__skip();  /* system doesn't support LBR */
+		goto cleanup;
+	}
+
+	skel = get_branch_snapshot__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "get_branch_snapshot__open_and_load"))
+		goto cleanup;
+
+	err = kallsyms_find("bpf_testmod_loop_test", &skel->bss->address_low);
+	if (!ASSERT_OK(err, "kallsyms_find"))
+		goto cleanup;
+
+	err = kallsyms_find_next("bpf_testmod_loop_test", &skel->bss->address_high);
+	if (!ASSERT_OK(err, "kallsyms_find_next"))
+		goto cleanup;
+
+	err = get_branch_snapshot__attach(skel);
+	if (!ASSERT_OK(err, "get_branch_snapshot__attach"))
+		goto cleanup;
+
+	trigger_module_test_read(100);
+
+	if (skel->bss->total_entries < 16) {
+		/* too few entries for the hit/waste test */
+		test__skip();
+		goto cleanup;
+	}
+
+	ASSERT_GT(skel->bss->test1_hits, 6, "find_looptest_in_lbr");
+
+	/* Given we stop LBR in software, we will waste a few entries.
+	 * But we should try to waste as few as possible entries. We are at
+	 * about 7 on x86_64 systems.
+	 * Add a check for < 10 so that we get heads-up when something
+	 * changes and wastes too many entries.
+	 */
+	ASSERT_LT(skel->bss->wasted_entries, 10, "check_wasted_entries");
+
+cleanup:
+	get_branch_snapshot__destroy(skel);
+	close_perf_events();
+}
diff --git a/tools/testing/selftests/bpf/prog_tests/module_attach.c b/tools/testing/selftests/bpf/prog_tests/module_attach.c
index d85a69b7ce44..1797a6e4d6d8 100644
--- a/tools/testing/selftests/bpf/prog_tests/module_attach.c
+++ b/tools/testing/selftests/bpf/prog_tests/module_attach.c
@@ -6,45 +6,6 @@
 
 static int duration;
 
-static int trigger_module_test_read(int read_sz)
-{
-	int fd, err;
-
-	fd = open("/sys/kernel/bpf_testmod", O_RDONLY);
-	err = -errno;
-	if (CHECK(fd < 0, "testmod_file_open", "failed: %d\n", err))
-		return err;
-
-	read(fd, NULL, read_sz);
-	close(fd);
-
-	return 0;
-}
-
-static int trigger_module_test_write(int write_sz)
-{
-	int fd, err;
-	char *buf = malloc(write_sz);
-
-	if (!buf)
-		return -ENOMEM;
-
-	memset(buf, 'a', write_sz);
-	buf[write_sz-1] = '\0';
-
-	fd = open("/sys/kernel/bpf_testmod", O_WRONLY);
-	err = -errno;
-	if (CHECK(fd < 0, "testmod_file_open", "failed: %d\n", err)) {
-		free(buf);
-		return err;
-	}
-
-	write(fd, buf, write_sz);
-	close(fd);
-	free(buf);
-	return 0;
-}
-
 static int delete_module(const char *name, int flags)
 {
 	return syscall(__NR_delete_module, name, flags);
diff --git a/tools/testing/selftests/bpf/progs/get_branch_snapshot.c b/tools/testing/selftests/bpf/progs/get_branch_snapshot.c
new file mode 100644
index 000000000000..a1b139888048
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/get_branch_snapshot.c
@@ -0,0 +1,40 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2021 Facebook */
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+char _license[] SEC("license") = "GPL";
+
+__u64 test1_hits = 0;
+__u64 address_low = 0;
+__u64 address_high = 0;
+int wasted_entries = 0;
+long total_entries = 0;
+
+#define ENTRY_CNT 32
+struct perf_branch_entry entries[ENTRY_CNT] = {};
+
+static inline bool in_range(__u64 val)
+{
+	return (val >= address_low) && (val < address_high);
+}
+
+SEC("fexit/bpf_testmod_loop_test")
+int BPF_PROG(test1, int n, int ret)
+{
+	long i;
+
+	total_entries = bpf_get_branch_snapshot(entries, sizeof(entries), 0);
+	total_entries /= sizeof(struct perf_branch_entry);
+
+	for (i = 0; i < ENTRY_CNT; i++) {
+		if (i >= total_entries)
+			break;
+		if (in_range(entries[i].from) && in_range(entries[i].to))
+			test1_hits++;
+		else if (!test1_hits)
+			wasted_entries++;
+	}
+	return 0;
+}
diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c
index cc1cd240445d..2ed01f615d20 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -743,6 +743,45 @@ int cd_flavor_subdir(const char *exec_name)
 	return chdir(flavor);
 }
 
+int trigger_module_test_read(int read_sz)
+{
+	int fd, err;
+
+	fd = open("/sys/kernel/bpf_testmod", O_RDONLY);
+	err = -errno;
+	if (!ASSERT_GE(fd, 0, "testmod_file_open"))
+		return err;
+
+	read(fd, NULL, read_sz);
+	close(fd);
+
+	return 0;
+}
+
+int trigger_module_test_write(int write_sz)
+{
+	int fd, err;
+	char *buf = malloc(write_sz);
+
+	if (!buf)
+		return -ENOMEM;
+
+	memset(buf, 'a', write_sz);
+	buf[write_sz-1] = '\0';
+
+	fd = open("/sys/kernel/bpf_testmod", O_WRONLY);
+	err = -errno;
+	if (!ASSERT_GE(fd, 0, "testmod_file_open")) {
+		free(buf);
+		return err;
+	}
+
+	write(fd, buf, write_sz);
+	close(fd);
+	free(buf);
+	return 0;
+}
+
 #define MAX_BACKTRACE_SZ 128
 void crash_handler(int signum)
 {
diff --git a/tools/testing/selftests/bpf/test_progs.h b/tools/testing/selftests/bpf/test_progs.h
index c8c2bf878f67..94bef0aa74cf 100644
--- a/tools/testing/selftests/bpf/test_progs.h
+++ b/tools/testing/selftests/bpf/test_progs.h
@@ -291,6 +291,8 @@ int compare_map_keys(int map1_fd, int map2_fd);
 int compare_stack_ips(int smap_fd, int amap_fd, int stack_trace_len);
 int extract_build_id(char *build_id, size_t size);
 int kern_sync_rcu(void);
+int trigger_module_test_read(int read_sz);
+int trigger_module_test_write(int write_sz);
 
 #ifdef __x86_64__
 #define SYS_NANOSLEEP_KPROBE_NAME "__x64_sys_nanosleep"
diff --git a/tools/testing/selftests/bpf/trace_helpers.c b/tools/testing/selftests/bpf/trace_helpers.c
index e7a19b04d4ea..5100a169b72b 100644
--- a/tools/testing/selftests/bpf/trace_helpers.c
+++ b/tools/testing/selftests/bpf/trace_helpers.c
@@ -1,4 +1,5 @@
 // SPDX-License-Identifier: GPL-2.0
+#include <ctype.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
@@ -117,6 +118,42 @@ out:
 	return err;
 }
 
+/* find the address of the next symbol of the same type, this can be used
+ * to determine the end of a function.
+ */
+int kallsyms_find_next(const char *sym, unsigned long long *addr)
+{
+	char type, found_type, name[500];
+	unsigned long long value;
+	bool found = false;
+	int err = 0;
+	FILE *f;
+
+	f = fopen("/proc/kallsyms", "r");
+	if (!f)
+		return -EINVAL;
+
+	while (fscanf(f, "%llx %c %499s%*[^\n]\n", &value, &type, name) > 0) {
+		/* Different types of symbols in kernel modules are mixed
+		 * in /proc/kallsyms. Only return the next matching type.
+		 * Use tolower() for type so that 'T' matches 't'.
+		 */
+		if (found && found_type == tolower(type)) {
+			*addr = value;
+			goto out;
+		}
+		if (strcmp(name, sym) == 0) {
+			found = true;
+			found_type = tolower(type);
+		}
+	}
+	err = -ENOENT;
+
+out:
+	fclose(f);
+	return err;
+}
+
 void read_trace_pipe(void)
 {
 	int trace_fd;
diff --git a/tools/testing/selftests/bpf/trace_helpers.h b/tools/testing/selftests/bpf/trace_helpers.h
index d907b445524d..bc8ed86105d9 100644
--- a/tools/testing/selftests/bpf/trace_helpers.h
+++ b/tools/testing/selftests/bpf/trace_helpers.h
@@ -16,6 +16,11 @@ long ksym_get_addr(const char *name);
 /* open kallsyms and find addresses on the fly, faster than load + search. */
 int kallsyms_find(const char *sym, unsigned long long *addr);
 
+/* find the address of the next symbol, this can be used to determine the
+ * end of a function
+ */
+int kallsyms_find_next(const char *sym, unsigned long long *addr);
+
 void read_trace_pipe(void);
 
 ssize_t get_uprobe_offset(const void *addr, ssize_t base);
author	Alexei Starovoitov <ast@kernel.org>	2021-09-13 10:53:50 -0700
committer	Alexei Starovoitov <ast@kernel.org>	2021-09-13 10:53:51 -0700
commit	14bef1ab30378467385623c7ff8fccad570c4a13 (patch)
tree	2d03c43e8c339f5b95a134cfe802862b65d45d8b /tools
parent	3384c7c7641b44987e35eadbc9df6c16a0520159 (diff)
parent	025bd7c753aab18cd594924a46ab46ac47209df9 (diff)