summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2023-09-15bpf: make it easier to add new metadata kfuncStanislav Fomichev
No functional changes. Instead of having hand-crafted code in bpf_dev_bound_resolve_kfunc, move kfunc <> xmo handler relationship into XDP_METADATA_KFUNC_xxx. This way, any time new kfunc is added, we don't have to touch bpf_dev_bound_resolve_kfunc. Also document XDP_METADATA_KFUNC_xxx arguments since we now have more than two and it might be confusing what is what. Cc: netdev@vger.kernel.org Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20230913171350.369987-2-sdf@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-09-15xsk: add multi-buffer support for sockets sharing umemTirthendu Sarkar
Userspace applications indicate their multi-buffer capability to xsk using XSK_USE_SG socket bind flag. For sockets using shared umem the bind flag may contain XSK_USE_SG only for the first socket. For any subsequent socket the only option supported is XDP_SHARED_UMEM. Add option XDP_UMEM_SG_FLAG in umem config flags to store the multi-buffer handling capability when indicated by XSK_USE_SG option in bing flag by the first socket. Use this to derive multi-buffer capability for subsequent sockets in xsk core. Signed-off-by: Tirthendu Sarkar <tirthendu.sarkar@intel.com> Fixes: 81470b5c3c66 ("xsk: introduce XSK_USE_SG bind flag for xsk socket") Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/r/20230907035032.2627879-1-tirthendu.sarkar@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-14bpf: Charge modmem for struct_ops trampolineSong Liu
Current code charges modmem for regular trampoline, but not for struct_ops trampoline. Add bpf_jit_[charge|uncharge]_modmem() to struct_ops so the trampoline is charged in both cases. Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230914222542.2986059-1-song@kernel.org Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-09-14selftests/bpf: Skip module_fentry_shadow test when bpf_testmod is not availableArtem Savkov
This test relies on bpf_testmod, so skip it if the module is not available. Fixes: aa3d65de4b900 ("bpf/selftests: Test fentry attachment to shadowed functions") Signed-off-by: Artem Savkov <asavkov@redhat.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20230914124928.340701-1-asavkov@redhat.com
2023-09-14Merge branch 'seltests-xsk-various-improvements-to-xskxceiver'Alexei Starovoitov
Magnus Karlsson says: ==================== seltests/xsk: various improvements to xskxceiver This patch set implements several improvements to the xsk selftests test suite that I thought were useful while debugging the xsk multi-buffer code and tests. The largest new feature is the ability to be able to execute a single test instead of the whole test suite. This required some surgery on the current code, details below. Anatomy of the path set: 1: Print useful info on a per packet basis with the option -v 2: Add a timeout in the transmission loop too. We only used to have one for the Rx thread, but Tx can lock up too waiting for completions. 3: Add an option (-m) to only run the tests (or a single test with a later patch) in a single mode: skb, drv, or zc (zero-copy). 4-5: Preparatory patches to be able to specify a test to run. Need to define the test names in a single structure and their entry points, so we can use this when wanting to run a specific test. 6: Adds a command line option (-l) that lists all the tests. 7: Adds a command line option (-t) that runs a specific test instead of the whole test suite. Can be combined with -m to specify a single mode too. 8: Use ksft_print_msg() uniformly throughout the tests. It was a mix of printf() and ksft_print_msg() before. 9: In some places, we failed the whole test suite instead of a single test in certain circumstances. Fix this so only the test in question is failed and the rest of the test suite continues. 10: Display the available command line options with -h v3 -> v4: * Fixed another spelling error in patch #9 [Maciej] * Only allow the actual strings for the -m command [Maciej] * Move some code from patch #7 to #3 [Maciej] v2 -> v3: * Drop the support for environment variables. Probably not useful. [Maciej] * Fixed spelling mistake in patch #9 [Maciej] * Fail gracefully if unsupported mode is chosen [Maciej] * Simplified test run loop [Maciej] v1 -> v2: * Introduce XSKTEST_MODE env variable to be able to set the mode to use [Przemyslaw] * Introduce XSKTEST_ETH env variable to be able to set the ethernet interface to use by introducing a new patch (#11) [Magnus] * Fixed spelling error in patch #5 [Przemyslaw, Maciej] * Fixed confusing documentation in patch #10 [Przemyslaw] * The -l option can now be used without being root [Magnus, Maciej] * Fixed documentation error in patch #7 [Maciej] * Added error handling to the -t option [Maciej] * -h now displayed as an option [Maciej] Thanks: Magnus ==================== Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/r/20230914084900.492-1-magnus.karlsson@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-14selftests/xsk: display command line options with -hMagnus Karlsson
Add the -h option to display all available command line options available for test_xsk.sh and xskxceiver. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/r/20230914084900.492-11-magnus.karlsson@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-14selftests/xsk: fail single test instead of all testsMagnus Karlsson
In a number of places at en error, exit_with_error() is called that terminates the whole test suite. This is not always desirable as it would be more logical to only fail that test and then go along with the other ones. So change this in a number of places in which I thought it would be more logical to just fail the test in question. Examples of this are in code that is only used by a single test. Also delete a pointless if-statement in receive_pkts() that has an exit_with_error() in it. It can never occur since the return value is an unsigned and the test is for less than zero. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/r/20230914084900.492-10-magnus.karlsson@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-14selftests/xsk: use ksft_print_msg uniformlyMagnus Karlsson
Use ksft_print_msg() instead of printf() and fprintf() in all places as the ksefltests framework is being used. There is only one exception and that is for the list-of-tests print out option, since no tests are run in that case. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/r/20230914084900.492-9-magnus.karlsson@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-14selftests/xsk: add option to run single testMagnus Karlsson
Add a command line option to be able to run a single test. This option (-t) takes a number from the list of tests available with the "-l" option. Here are two examples: Run test number 2, the "receive single packet" test in all available modes: ./test_xsk.sh -t 2 Run test number 21, the metadata copy test in skb mode only ./test_xsh.sh -t 21 -m skb Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/r/20230914084900.492-8-magnus.karlsson@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-14selftests/xsk: add option that lists all testsMagnus Karlsson
Add a command line option (-l) that lists all the tests. The number before the test will be used in the next commit for specifying a single test to run. Here is an example of the output: Tests: 0: SEND_RECEIVE 1: SEND_RECEIVE_2K_FRAME 2: SEND_RECEIVE_SINGLE_PKT 3: POLL_RX 4: POLL_TX 5: POLL_RXQ_FULL 6: POLL_TXQ_FULL 7: SEND_RECEIVE_UNALIGNED : : Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/r/20230914084900.492-7-magnus.karlsson@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-14selftests/xsk: declare test names in structMagnus Karlsson
Declare the test names statically in a struct so that we can refer to them when adding the support to execute a single test in the next commit. Before this patch, the names of them were not declared in a single place which made it not possible to refer to them. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/r/20230914084900.492-6-magnus.karlsson@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-14selftests/xsk: move all tests to separate functionsMagnus Karlsson
Prepare for the capability to be able to run a single test by moving all the tests to their own functions. This function can then be called to execute that test in the next commit. Also, the tests named RUN_TO_COMPLETION_* were not named well, so change them to SEND_RECEIVE_* as it is just a basic send and receive test of 4K packets. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/r/20230914084900.492-5-magnus.karlsson@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-14selftests/xsk: add option to only run tests in a single modeMagnus Karlsson
Add an option -m on the command line that allows the user to run the tests in a single mode instead of all of them. Valid modes are skb, drv, and zc (zero-copy). An example: To run test suite in drv mode only: ./test_xsk.sh -m drv Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/r/20230914084900.492-4-magnus.karlsson@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-14selftests/xsk: add timeout for Tx threadMagnus Karlsson
Add a timeout for the transmission thread. If packets are not completed properly, for some reason, the test harness would previously get stuck forever in a while loop. But with this patch, this timeout will trigger, flag the test as a failure, and continue with the next test. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/r/20230914084900.492-3-magnus.karlsson@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-14selftests/xsk: print per packet info in verbose modeMagnus Karlsson
Print info about every packet in verbose mode, both for Tx and Rx. This is useful to have when a test fails or to validate that a test is really doing what it was designed to do. Info on what is supposed to be received and sent is also printed for the custom packet streams since they differ from the base line. Here is an example: Tx addr: 37e0 len: 64 options: 0 pkt_nb: 8 Tx addr: 4000 len: 64 options: 0 pkt_nb: 9 Rx: addr: 100 len: 64 options: 0 pkt_nb: 0 valid: 1 Rx: addr: 1100 len: 64 options: 0 pkt_nb: 1 valid: 1 Rx: addr: 2100 len: 64 options: 0 pkt_nb: 4 valid: 1 Rx: addr: 3100 len: 64 options: 0 pkt_nb: 8 valid: 1 Rx: addr: 4100 len: 64 options: 0 pkt_nb: 9 valid: 1 One pointless verbose print statement is also deleted and another one is made clearer. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/r/20230914084900.492-2-magnus.karlsson@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-13docs/bpf: update out-of-date doc in BPF flow dissectorQuan Tian
Commit a5e2151ff9d5 ("net/ipv6: SKB symmetric hash should incorporate transport ports") removed the use of FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL in __skb_get_hash_symmetric(), making the doc out-of-date. Signed-off-by: Quan Tian <qtian@vmware.com> Link: https://lore.kernel.org/r/20230911152353.8280-1-qtian@vmware.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-09-12Merge branch 'bpf-x64-fix-tailcall-infinite-loop'Alexei Starovoitov
Leon Hwang says: ==================== bpf, x64: Fix tailcall infinite loop This patch series fixes a tailcall infinite loop on x64. From commit ebf7d1f508a73871 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT"), the tailcall on x64 works better than before. From commit e411901c0b775a3a ("bpf: allow for tailcalls in BPF subprograms for x64 JIT"), tailcall is able to run in BPF subprograms on x64. From commit 5b92a28aae4dd0f8 ("bpf: Support attaching tracing BPF program to other BPF programs"), BPF program is able to trace other BPF programs. How about combining them all together? 1. FENTRY/FEXIT on a BPF subprogram. 2. A tailcall runs in the BPF subprogram. 3. The tailcall calls the subprogram's caller. As a result, a tailcall infinite loop comes up. And the loop would halt the machine. As we know, in tail call context, the tail_call_cnt propagates by stack and rax register between BPF subprograms. So do in trampolines. How did I discover the bug? From commit 7f6e4312e15a5c37 ("bpf: Limit caller's stack depth 256 for subprogs with tailcalls"), the total stack size limits to around 8KiB. Then, I write some bpf progs to validate the stack consuming, that are tailcalls running in bpf2bpf and FENTRY/FEXIT tracing on bpf2bpf. At that time, accidently, I made a tailcall loop. And then the loop halted my VM. Without the loop, the bpf progs would consume over 8KiB stack size. But the _stack-overflow_ did not halt my VM. With bpf_printk(), I confirmed that the tailcall count limit did not work expectedly. Next, read the code and fix it. Thank Ilya Leoshkevich, this bug on s390x has been fixed. Hopefully, this bug on arm64 will be fixed in near future. ==================== Link: https://lore.kernel.org/r/20230912150442.2009-1-hffilwlqm@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-12selftests/bpf: Add testcases for tailcall infinite loop fixingLeon Hwang
Add 4 test cases to confirm the tailcall infinite loop bug has been fixed. Like tailcall_bpf2bpf cases, do fentry/fexit on the bpf2bpf, and then check the final count result. tools/testing/selftests/bpf/test_progs -t tailcalls 226/13 tailcalls/tailcall_bpf2bpf_fentry:OK 226/14 tailcalls/tailcall_bpf2bpf_fexit:OK 226/15 tailcalls/tailcall_bpf2bpf_fentry_fexit:OK 226/16 tailcalls/tailcall_bpf2bpf_fentry_entry:OK 226 tailcalls:OK Summary: 1/16 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Leon Hwang <hffilwlqm@gmail.com> Link: https://lore.kernel.org/r/20230912150442.2009-4-hffilwlqm@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-12bpf, x64: Fix tailcall infinite loopLeon Hwang
From commit ebf7d1f508a73871 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT"), the tailcall on x64 works better than before. From commit e411901c0b775a3a ("bpf: allow for tailcalls in BPF subprograms for x64 JIT"), tailcall is able to run in BPF subprograms on x64. From commit 5b92a28aae4dd0f8 ("bpf: Support attaching tracing BPF program to other BPF programs"), BPF program is able to trace other BPF programs. How about combining them all together? 1. FENTRY/FEXIT on a BPF subprogram. 2. A tailcall runs in the BPF subprogram. 3. The tailcall calls the subprogram's caller. As a result, a tailcall infinite loop comes up. And the loop would halt the machine. As we know, in tail call context, the tail_call_cnt propagates by stack and rax register between BPF subprograms. So do in trampolines. Fixes: ebf7d1f508a7 ("bpf, x64: rework pro/epilogue and tailcall handling in JIT") Fixes: e411901c0b77 ("bpf: allow for tailcalls in BPF subprograms for x64 JIT") Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Leon Hwang <hffilwlqm@gmail.com> Link: https://lore.kernel.org/r/20230912150442.2009-3-hffilwlqm@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-12bpf, x64: Comment tail_call_cnt initialisationLeon Hwang
Without understanding emit_prologue(), it is really hard to figure out where does tail_call_cnt come from, even though searching tail_call_cnt in the whole kernel repo. By adding these comments, it is a little bit easier to understand tail_call_cnt initialisation. Signed-off-by: Leon Hwang <hffilwlqm@gmail.com> Link: https://lore.kernel.org/r/20230912150442.2009-2-hffilwlqm@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-11selftests/bpf: Correct map_fd to data_fd in tailcallsLeon Hwang
Get and check data_fd. It should not check map_fd again. Meanwhile, correct some 'return' to 'goto out'. Thank the suggestion from Maciej in "bpf, x64: Fix tailcall infinite loop"[0] discussions. [0] https://lore.kernel.org/bpf/e496aef8-1f80-0f8e-dcdd-25a8c300319a@gmail.com/T/#m7d3b601066ba66400d436b7e7579b2df4a101033 Fixes: 79d49ba048ec ("bpf, testing: Add various tail call test cases") Fixes: 3b0379111197 ("selftests/bpf: Add tailcall_bpf2bpf tests") Fixes: 5e0b0a4c52d3 ("selftests/bpf: Test tail call counting with bpf2bpf and data on stack") Signed-off-by: Leon Hwang <hffilwlqm@gmail.com> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/r/20230906154256.95461-1-hffilwlqm@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-09-08bpftool: Fix -Wcast-qual warningDenys Zagorui
This cast was made by purpose for older libbpf where the bpf_object_skeleton field is void * instead of const void * to eliminate a warning (as i understand -Wincompatible-pointer-types-discards-qualifiers) but this cast introduces another warning (-Wcast-qual) for libbpf where data field is const void * It makes sense for bpftool to be in sync with libbpf from kernel sources Signed-off-by: Denys Zagorui <dzagorui@cisco.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/bpf/20230907090210.968612-1-dzagorui@cisco.com
2023-09-08Merge branch 'selftests/bpf: Optimize kallsyms cache'Andrii Nakryiko
Rong Tao says: ==================== We need to optimize the kallsyms cache, including optimizations for the number of symbols limit, and, some test cases add new kernel symbols (such as testmods) and we need to refresh kallsyms (reload or refresh). ==================== Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2023-09-08selftests/bpf: trace_helpers.c: Add a global ksyms initialization mutexRong Tao
As Jirka said [0], we just need to make sure that global ksyms initialization won't race. [0] https://lore.kernel.org/lkml/ZPCbAs3ItjRd8XVh@krava/ Signed-off-by: Rong Tao <rongtao@cestc.cn> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/tencent_5D0A837E219E2CFDCB0495DAD7D5D1204407@qq.com
2023-09-08selftests/bpf: trace_helpers.c: Optimize kallsyms cacheRong Tao
Static ksyms often have problems because the number of symbols exceeds the MAX_SYMS limit. Like changing the MAX_SYMS from 300000 to 400000 in commit e76a014334a6("selftests/bpf: Bump and validate MAX_SYMS") solves the problem somewhat, but it's not the perfect way. This commit uses dynamic memory allocation, which completely solves the problem caused by the limitation of the number of kallsyms. At the same time, add APIs: load_kallsyms_local() ksym_search_local() ksym_get_addr_local() free_kallsyms_local() There are used to solve the problem of selftests/bpf updating kallsyms after attach new symbols during testmod testing. Signed-off-by: Rong Tao <rongtao@cestc.cn> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/bpf/tencent_C9BDA68F9221F21BE4081566A55D66A9700A@qq.com
2023-09-08Merge branch 'bpf-task_group_seq_get_next-misc-cleanups'Alexei Starovoitov
Oleg Nesterov says: ==================== bpf: task_group_seq_get_next: misc cleanups Yonghong, I am resending 1-5 of 6 as you suggested with your acks included. The next (final) patch will change this code to use __next_thread when https://lore.kernel.org/all/20230824143142.GA31222@redhat.com/ is merged. Oleg. ==================== Link: https://lore.kernel.org/r/20230905154612.GA24872@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08Merge branch 'bpf-enable-irq-after-irq_work_raise-completes'Alexei Starovoitov
Hou Tao says: ==================== bpf: Enable IRQ after irq_work_raise() completes From: Hou Tao <houtao1@huawei.com> Hi, The patchset aims to fix the problem that bpf_mem_alloc() may return NULL unexpectedly when multiple bpf_mem_alloc() are invoked concurrently under process context and there is still free memory available. The problem was found when doing stress test for qp-trie but the same problem also exists for bpf_obj_new() as demonstrated in patch #3. As pointed out by Alexei, the patchset can only fix ENOMEM problem for normal process context and can not fix the problem for irq-disabled context or RT-enabled kernel. Patch #1 fixes the race between unit_alloc() and unit_alloc(). Patch #2 fixes the race between unit_alloc() and unit_free(). And patch #3 adds a selftest for the problem. The major change compared with v1 is using local_irq_{save,restore)() pair to disable and enable preemption instead of preempt_{disable,enable}_notrace pair. The main reason is to prevent potential overhead from __preempt_schedule_notrace(). I also run htab_mem benchmark and hash_map_perf on a 8-CPUs KVM VM to compare the performance between local_irq_{save,restore} and preempt_{disable,enable}_notrace(), but the results are similar as shown below: (1) use preempt_{disable,enable}_notrace() [root@hello bpf]# ./map_perf_test 4 8 0:hash_map_perf kmalloc 652179 events per sec 1:hash_map_perf kmalloc 651880 events per sec 2:hash_map_perf kmalloc 651382 events per sec 3:hash_map_perf kmalloc 650791 events per sec 5:hash_map_perf kmalloc 650140 events per sec 6:hash_map_perf kmalloc 652773 events per sec 7:hash_map_perf kmalloc 652751 events per sec 4:hash_map_perf kmalloc 648199 events per sec [root@hello bpf]# ./benchs/run_bench_htab_mem.sh normal bpf ma ============= overwrite per-prod-op: 110.82 ± 0.02k/s, avg mem: 2.00 ± 0.00MiB, peak mem: 2.73MiB batch_add_batch_del per-prod-op: 89.79 ± 0.75k/s, avg mem: 1.68 ± 0.38MiB, peak mem: 2.73MiB add_del_on_diff_cpu per-prod-op: 17.83 ± 0.07k/s, avg mem: 25.68 ± 2.92MiB, peak mem: 35.10MiB (2) use local_irq_{save,restore} [root@hello bpf]# ./map_perf_test 4 8 0:hash_map_perf kmalloc 656299 events per sec 1:hash_map_perf kmalloc 656397 events per sec 2:hash_map_perf kmalloc 656046 events per sec 3:hash_map_perf kmalloc 655723 events per sec 5:hash_map_perf kmalloc 655221 events per sec 4:hash_map_perf kmalloc 654617 events per sec 6:hash_map_perf kmalloc 650269 events per sec 7:hash_map_perf kmalloc 653665 events per sec [root@hello bpf]# ./benchs/run_bench_htab_mem.sh normal bpf ma ============= overwrite per-prod-op: 116.10 ± 0.02k/s, avg mem: 2.00 ± 0.00MiB, peak mem: 2.74MiB batch_add_batch_del per-prod-op: 88.76 ± 0.61k/s, avg mem: 1.94 ± 0.33MiB, peak mem: 2.74MiB add_del_on_diff_cpu per-prod-op: 18.12 ± 0.08k/s, avg mem: 25.10 ± 2.70MiB, peak mem: 34.78MiB As ususal comments are always welcome. Change Log: v2: * Use local_irq_save to disable preemption instead of using preempt_{disable,enable}_notrace pair to prevent potential overhead v1: https://lore.kernel.org/bpf/20230822133807.3198625-1-houtao@huaweicloud.com/ ==================== Link: https://lore.kernel.org/r/20230901111954.1804721-1-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08bpf: task_group_seq_get_next: simplify the "next tid" logicOleg Nesterov
Kill saved_tid. It looks ugly to update *tid and then restore the previous value if __task_pid_nr_ns() returns 0. Change this code to update *tid and common->pid_visiting once before return. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230905154656.GA24950@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08selftests/bpf: Test preemption between bpf_obj_new() and bpf_obj_drop()Hou Tao
The test case creates 4 threads and then pins these 4 threads in CPU 0. These 4 threads will run different bpf program through bpf_prog_test_run_opts() and these bpf program will use bpf_obj_new() and bpf_obj_drop() to allocate and free local kptrs concurrently. Under preemptible kernel, bpf_obj_new() and bpf_obj_drop() may preempt each other, bpf_obj_new() may return NULL and the test will fail before applying these fixes as shown below: test_preempted_bpf_ma_op:PASS:open_and_load 0 nsec test_preempted_bpf_ma_op:PASS:attach 0 nsec test_preempted_bpf_ma_op:PASS:no test prog 0 nsec test_preempted_bpf_ma_op:PASS:no test prog 0 nsec test_preempted_bpf_ma_op:PASS:no test prog 0 nsec test_preempted_bpf_ma_op:PASS:no test prog 0 nsec test_preempted_bpf_ma_op:PASS:pthread_create 0 nsec test_preempted_bpf_ma_op:PASS:pthread_create 0 nsec test_preempted_bpf_ma_op:PASS:pthread_create 0 nsec test_preempted_bpf_ma_op:PASS:pthread_create 0 nsec test_preempted_bpf_ma_op:PASS:run prog err 0 nsec test_preempted_bpf_ma_op:PASS:run prog err 0 nsec test_preempted_bpf_ma_op:PASS:run prog err 0 nsec test_preempted_bpf_ma_op:PASS:run prog err 0 nsec test_preempted_bpf_ma_op:FAIL:ENOMEM unexpected ENOMEM: got TRUE #168 preempted_bpf_ma_op:FAIL Summary: 0/0 PASSED, 0 SKIPPED, 1 FAILED Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20230901111954.1804721-4-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08bpf: task_group_seq_get_next: kill next_taskOleg Nesterov
It only adds the unnecessary confusion and compicates the "retry" code. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230905154654.GA24945@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08bpf: Enable IRQ after irq_work_raise() completes in unit_free{_rcu}()Hou Tao
Both unit_free() and unit_free_rcu() invoke irq_work_raise() to free freed objects back to slab and the invocation may also be preempted by unit_alloc() and unit_alloc() may return NULL unexpectedly as shown in the following case: task A task B unit_free() // high_watermark = 48 // free_cnt = 49 after free irq_work_raise() // mark irq work as IRQ_WORK_PENDING irq_work_claim() // task B preempts task A unit_alloc() // free_cnt = 48 after alloc // does unit_alloc() 32-times ...... // free_cnt = 16 unit_alloc() // free_cnt = 15 after alloc // irq work is already PENDING, // so just return irq_work_raise() // does unit_alloc() 15-times ...... // free_cnt = 0 unit_alloc() // free_cnt = 0 before alloc return NULL Fix it by enabling IRQ after irq_work_raise() completes. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20230901111954.1804721-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08bpf: task_group_seq_get_next: fix the skip_if_dup_files checkOleg Nesterov
Unless I am notally confused it is wrong. We are going to return or skip next_task so we need to check next_task-files, not task->files. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230905154651.GA24940@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08bpf: task_group_seq_get_next: cleanup the usage of get/put_task_structOleg Nesterov
get_pid_task() makes no sense, the code does put_task_struct() soon after. Use find_task_by_pid_ns() instead of find_pid_ns + get_pid_task and kill put_task_struct(), this allows to do get_task_struct() only once before return. While at it, kill the unnecessary "if (!pid)" check in the "if (!*tid)" block, this matches the next usage of find_pid_ns() + get_pid_task() in this function. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230905154649.GA24935@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08bpf: task_group_seq_get_next: cleanup the usage of next_thread()Oleg Nesterov
1. find_pid_ns() + get_pid_task() under rcu_read_lock() guarantees that we can safely iterate the task->thread_group list. Even if this task exits right after get_pid_task() (or goto retry) and pid_alive() returns 0. Kill the unnecessary pid_alive() check. 2. next_thread() simply can't return NULL, kill the bogus "if (!next_task)" check. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230905154646.GA24928@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08Merge branch 'bpf-add-support-for-local-percpu-kptr'Alexei Starovoitov
Yonghong Song says: ==================== bpf: Add support for local percpu kptr Patch set [1] implemented cgroup local storage BPF_MAP_TYPE_CGRP_STORAGE similar to sk/task/inode local storage and old BPF_MAP_TYPE_CGROUP_STORAGE map is marked as deprecated since old BPF_MAP_TYPE_CGROUP_STORAGE map can only work with current cgroup. Similarly, the existing BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE map is a percpu version of BPF_MAP_TYPE_CGROUP_STORAGE and only works with current cgroup. But there is no replacement which can work with arbitrary cgroup. This patch set solved this problem but adding support for local percpu kptr. The map value can have a percpu kptr field which holds a bpf prog allocated percpu data. The below is an example, struct percpu_val_t { ... fields ... } struct map_value_t { struct percpu_val_t __percpu_kptr *percpu_data_ptr; } In the above, 'map_value_t' is the map value type for a BPF_MAP_TYPE_CGRP_STORAGE map. User can access 'percpu_data_ptr' and then read/write percpu data. This covers BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE and more. So BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE map type is marked as deprecated. In additional, local percpu kptr supports the same map type as other kptrs including hash, lru_hash, array, sk/inode/task/cgrp local storage. Currently, percpu data structure does not support non-scalars or special fields (e.g., bpf_spin_lock, bpf_rb_root, etc.). They can be supported in the future if there exist use cases. Please for individual patches for details. [1] https://lore.kernel.org/all/20221026042835.672317-1-yhs@fb.com/ Changelog: v2 -> v3: - fix libbpf_str test failure. v1 -> v2: - does not support special fields in percpu data structure. - rename __percpu attr to __percpu_kptr attr. - rename BPF_KPTR_PERCPU_REF to BPF_KPTR_PERCPU. - better code to handle bpf_{this,per}_cpu_ptr() helpers. - add more negative tests. - fix a bpftool related test failure. ==================== Link: https://lore.kernel.org/r/20230827152729.1995219-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08bpf: Enable IRQ after irq_work_raise() completes in unit_alloc()Hou Tao
When doing stress test for qp-trie, bpf_mem_alloc() returned NULL unexpectedly because all qp-trie operations were initiated from bpf syscalls and there was still available free memory. bpf_obj_new() has the same problem as shown by the following selftest. The failure is due to the preemption. irq_work_raise() will invoke irq_work_claim() first to mark the irq work as pending and then inovke __irq_work_queue_local() to raise an IPI. So when the current task which is invoking irq_work_raise() is preempted by other task, unit_alloc() may return NULL for preemption task as shown below: task A task B unit_alloc() // low_watermark = 32 // free_cnt = 31 after alloc irq_work_raise() // mark irq work as IRQ_WORK_PENDING irq_work_claim() // task B preempts task A unit_alloc() // free_cnt = 30 after alloc // irq work is already PENDING, // so just return irq_work_raise() // does unit_alloc() 30-times ...... unit_alloc() // free_cnt = 0 before alloc return NULL Fix it by enabling IRQ after irq_work_raise() completes. An alternative fix is using preempt_{disable|enable}_notrace() pair, but it may have extra overhead. Another feasible fix is to only disable preemption or IRQ before invoking irq_work_queue() and enable preemption or IRQ after the invocation completes, but it can't handle the case when c->low_watermark is 1. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20230901111954.1804721-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08bpf: Mark BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE deprecatedYonghong Song
Now 'BPF_MAP_TYPE_CGRP_STORAGE + local percpu ptr' can cover all BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE functionality and more. So mark BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE deprecated. Also make changes in selftests/bpf/test_bpftool_synctypes.py and selftest libbpf_str to fix otherwise test errors. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230827152837.2003563-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08selftests/bpf: Add some negative testsYonghong Song
Add a few negative tests for common mistakes with using percpu kptr including: - store to percpu kptr. - type mistach in bpf_kptr_xchg arguments. - sleepable prog with untrusted arg for bpf_this_cpu_ptr(). - bpf_percpu_obj_new && bpf_obj_drop, and bpf_obj_new && bpf_percpu_obj_drop - struct with ptr for bpf_percpu_obj_new - struct with special field (e.g., bpf_spin_lock) for bpf_percpu_obj_new Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230827152832.2002421-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08selftests/bpf: Add tests for cgrp_local_storage with local percpu kptrYonghong Song
Add a non-sleepable cgrp_local_storage test with percpu kptr. The test does allocation of percpu data, assigning values to percpu data and retrieval of percpu data. The de-allocation of percpu data is done when the map is freed. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230827152827.2001784-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08selftests/bpf: Remove unnecessary direct read of local percpu kptrYonghong Song
For the second argument of bpf_kptr_xchg(), if the reg type contains MEM_ALLOC and MEM_PERCPU, which means a percpu allocation, after bpf_kptr_xchg(), the argument is marked as MEM_RCU and MEM_PERCPU if in rcu critical section. This way, re-reading from the map value is not needed. Remove it from the percpu_alloc_array.c selftest. Without previous kernel change, the test will fail like below: 0: R1=ctx(off=0,imm=0) R10=fp0 ; int BPF_PROG(test_array_map_10, int a) 0: (b4) w1 = 0 ; R1_w=0 ; int i, index = 0; 1: (63) *(u32 *)(r10 -4) = r1 ; R1_w=0 R10=fp0 fp-8=0000???? 2: (bf) r2 = r10 ; R2_w=fp0 R10=fp0 ; 3: (07) r2 += -4 ; R2_w=fp-4 ; e = bpf_map_lookup_elem(&array, &index); 4: (18) r1 = 0xffff88810e771800 ; R1_w=map_ptr(off=0,ks=4,vs=16,imm=0) 6: (85) call bpf_map_lookup_elem#1 ; R0_w=map_value_or_null(id=1,off=0,ks=4,vs=16,imm=0) 7: (bf) r6 = r0 ; R0_w=map_value_or_null(id=1,off=0,ks=4,vs=16,imm=0) R6_w=map_value_or_null(id=1,off=0,ks=4,vs=16,imm=0) ; if (!e) 8: (15) if r6 == 0x0 goto pc+81 ; R6_w=map_value(off=0,ks=4,vs=16,imm=0) ; bpf_rcu_read_lock(); 9: (85) call bpf_rcu_read_lock#87892 ; ; p = e->pc; 10: (bf) r7 = r6 ; R6=map_value(off=0,ks=4,vs=16,imm=0) R7_w=map_value(off=0,ks=4,vs=16,imm=0) 11: (07) r7 += 8 ; R7_w=map_value(off=8,ks=4,vs=16,imm=0) 12: (79) r6 = *(u64 *)(r6 +8) ; R6_w=percpu_rcu_ptr_or_null_val_t(id=2,off=0,imm=0) ; if (!p) { 13: (55) if r6 != 0x0 goto pc+13 ; R6_w=0 ; p = bpf_percpu_obj_new(struct val_t); 14: (18) r1 = 0x12 ; R1_w=18 16: (b7) r2 = 0 ; R2_w=0 17: (85) call bpf_percpu_obj_new_impl#87883 ; R0_w=percpu_ptr_or_null_val_t(id=4,ref_obj_id=4,off=0,imm=0) refs=4 18: (bf) r6 = r0 ; R0=percpu_ptr_or_null_val_t(id=4,ref_obj_id=4,off=0,imm=0) R6=percpu_ptr_or_null_val_t(id=4,ref_obj_id=4,off=0,imm=0) refs=4 ; if (!p) 19: (15) if r6 == 0x0 goto pc+69 ; R6=percpu_ptr_val_t(ref_obj_id=4,off=0,imm=0) refs=4 ; p1 = bpf_kptr_xchg(&e->pc, p); 20: (bf) r1 = r7 ; R1_w=map_value(off=8,ks=4,vs=16,imm=0) R7=map_value(off=8,ks=4,vs=16,imm=0) refs=4 21: (bf) r2 = r6 ; R2_w=percpu_ptr_val_t(ref_obj_id=4,off=0,imm=0) R6=percpu_ptr_val_t(ref_obj_id=4,off=0,imm=0) refs=4 22: (85) call bpf_kptr_xchg#194 ; R0_w=percpu_ptr_or_null_val_t(id=6,ref_obj_id=6,off=0,imm=0) refs=6 ; if (p1) { 23: (15) if r0 == 0x0 goto pc+3 ; R0_w=percpu_ptr_val_t(ref_obj_id=6,off=0,imm=0) refs=6 ; bpf_percpu_obj_drop(p1); 24: (bf) r1 = r0 ; R0_w=percpu_ptr_val_t(ref_obj_id=6,off=0,imm=0) R1_w=percpu_ptr_val_t(ref_obj_id=6,off=0,imm=0) refs=6 25: (b7) r2 = 0 ; R2_w=0 refs=6 26: (85) call bpf_percpu_obj_drop_impl#87882 ; ; v = bpf_this_cpu_ptr(p); 27: (bf) r1 = r6 ; R1_w=scalar(id=7) R6=scalar(id=7) 28: (85) call bpf_this_cpu_ptr#154 R1 type=scalar expected=percpu_ptr_, percpu_rcu_ptr_, percpu_trusted_ptr_ The R1 which gets its value from R6 is a scalar. But before insn 22, R6 is R6=percpu_ptr_val_t(ref_obj_id=4,off=0,imm=0) Its type is changed to a scalar at insn 22 without previous patch. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230827152821.2001129-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08bpf: Mark OBJ_RELEASE argument as MEM_RCU when possibleYonghong Song
In previous selftests/bpf patch, we have p = bpf_percpu_obj_new(struct val_t); if (!p) goto out; p1 = bpf_kptr_xchg(&e->pc, p); if (p1) { /* race condition */ bpf_percpu_obj_drop(p1); } p = e->pc; if (!p) goto out; After bpf_kptr_xchg(), we need to re-read e->pc into 'p'. This is due to that the second argument of bpf_kptr_xchg() is marked OBJ_RELEASE and it will be marked as invalid after the call. So after bpf_kptr_xchg(), 'p' is an unknown scalar, and the bpf program needs to reread from the map value. This patch checks if the 'p' has type MEM_ALLOC and MEM_PERCPU, and if 'p' is RCU protected. If this is the case, 'p' can be marked as MEM_RCU. MEM_ALLOC needs to be removed since 'p' is not an owning reference any more. Such a change makes re-read from the map value unnecessary. Note that re-reading 'e->pc' after bpf_kptr_xchg() might get a different value from 'p' if immediately before 'p = e->pc', another cpu may do another bpf_kptr_xchg() and swap in another value into 'e->pc'. If this is the case, then 'p = e->pc' may get either 'p' or another value, and race condition already exists. So removing direct re-reading seems fine too. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230827152816.2000760-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08selftests/bpf: Add tests for array map with local percpu kptrYonghong Song
Add non-sleepable and sleepable tests with percpu kptr. For non-sleepable test, four programs are executed in the order of: 1. allocate percpu data. 2. assign values to percpu data. 3. retrieve percpu data. 4. de-allocate percpu data. The sleepable prog tried to exercise all above 4 steps in a single prog. Also for sleepable prog, rcu_read_lock is needed to protect direct percpu ptr access (from map value) and following bpf_this_cpu_ptr() and bpf_per_cpu_ptr() helpers. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230827152811.2000125-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08selftests/bpf: Add bpf_percpu_obj_{new,drop}() macro in bpf_experimental.hYonghong Song
The new macro bpf_percpu_obj_{new/drop}() is very similar to bpf_obj_{new,drop}() as they both take a type as the argument. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230827152805.1999417-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08libbpf: Add __percpu_kptr macro definitionYonghong Song
Add __percpu_kptr macro definition in bpf_helpers.h. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230827152800.1998492-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08libbpf: Add basic BTF sanity validationAndrii Nakryiko
Implement a simple and straightforward BTF sanity check when parsing BTF data. Right now it's very basic and just validates that all the string offsets and type IDs are within valid range. For FUNC we also check that it points to FUNC_PROTO kinds. Even with such simple checks it fixes a bunch of crashes found by OSS fuzzer ([0]-[5]) and will allow fuzzer to make further progress. Some other invariants will be checked in follow up patches (like ensuring there is no infinite type loops), but this seems like a good start already. Adding FUNC -> FUNC_PROTO check revealed that one of selftests has a problem with FUNC pointing to VAR instead, so fix it up in the same commit. [0] https://github.com/libbpf/libbpf/issues/482 [1] https://github.com/libbpf/libbpf/issues/483 [2] https://github.com/libbpf/libbpf/issues/485 [3] https://github.com/libbpf/libbpf/issues/613 [4] https://github.com/libbpf/libbpf/issues/618 [5] https://github.com/libbpf/libbpf/issues/619 Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Alan Maguire <alan.maguire@oracle.com> Reviewed-by: Song Liu <song@kernel.org> Closes: https://github.com/libbpf/libbpf/issues/617 Link: https://lore.kernel.org/bpf/20230825202152.1813394-1-andrii@kernel.org
2023-09-08selftests/bpf: Update error message in negative linked_list testYonghong Song
Some error messages are changed due to the addition of percpu kptr support. Fix linked_list test with changed error messages. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230827152754.1997769-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08bpf: Add bpf_this_cpu_ptr/bpf_per_cpu_ptr support for allocated percpu objYonghong Song
The bpf helpers bpf_this_cpu_ptr() and bpf_per_cpu_ptr() are re-purposed for allocated percpu objects. For an allocated percpu obj, the reg type is 'PTR_TO_BTF_ID | MEM_PERCPU | MEM_RCU'. The return type for these two re-purposed helpera is 'PTR_TO_MEM | MEM_RCU | MEM_ALLOC'. The MEM_ALLOC allows that the per-cpu data can be read and written. Since the memory allocator bpf_mem_alloc() returns a ptr to a percpu ptr for percpu data, the first argument of bpf_this_cpu_ptr() and bpf_per_cpu_ptr() is patched with a dereference before passing to the helper func. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230827152749.1997202-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08bpf: Add alloc/xchg/direct_access support for local percpu kptrYonghong Song
Add two new kfunc's, bpf_percpu_obj_new_impl() and bpf_percpu_obj_drop_impl(), to allocate a percpu obj. Two functions are very similar to bpf_obj_new_impl() and bpf_obj_drop_impl(). The major difference is related to percpu handling. bpf_rcu_read_lock() struct val_t __percpu_kptr *v = map_val->percpu_data; ... bpf_rcu_read_unlock() For a percpu data map_val like above 'v', the reg->type is set as PTR_TO_BTF_ID | MEM_PERCPU | MEM_RCU if inside rcu critical section. MEM_RCU marking here is similar to NON_OWN_REF as 'v' is not a owning reference. But NON_OWN_REF is trusted and typically inside the spinlock while MEM_RCU is under rcu read lock. RCU is preferred here since percpu data structures mean potential concurrent access into its contents. Also, bpf_percpu_obj_new_impl() is restricted such that no pointers or special fields are allowed. Therefore, the bpf_list_head and bpf_rb_root will not be supported in this patch set to avoid potential memory leak issue due to racing between bpf_obj_free_fields() and another bpf_kptr_xchg() moving an allocated object to bpf_list_head and bpf_rb_root. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230827152744.1996739-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08bpf: Add BPF_KPTR_PERCPU as a field typeYonghong Song
BPF_KPTR_PERCPU represents a percpu field type like below struct val_t { ... fields ... }; struct t { ... struct val_t __percpu_kptr *percpu_data_ptr; ... }; where #define __percpu_kptr __attribute__((btf_type_tag("percpu_kptr"))) While BPF_KPTR_REF points to a trusted kernel object or a trusted local object, BPF_KPTR_PERCPU points to a trusted local percpu object. This patch added basic support for BPF_KPTR_PERCPU related to percpu_kptr field parsing, recording and free operations. BPF_KPTR_PERCPU also supports the same map types as BPF_KPTR_REF does. Note that unlike a local kptr, it is possible that a BPF_KTPR_PERCPU struct may not contain any special fields like other kptr, bpf_spin_lock, bpf_list_head, etc. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230827152739.1996391-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-08bpf: Add support for non-fix-size percpu mem allocationYonghong Song
This is needed for later percpu mem allocation when the allocation is done by bpf program. For such cases, a global bpf_global_percpu_ma is added where a flexible allocation size is needed. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230827152734.1995725-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>