summaryrefslogtreecommitdiff
path: root/tools/include/uapi/linux/bpf.h
AgeCommit message (Collapse)Author
2023-07-25bpf, net: Support SO_REUSEPORT sockets with bpf_sk_assignLorenz Bauer
Currently the bpf_sk_assign helper in tc BPF context refuses SO_REUSEPORT sockets. This means we can't use the helper to steer traffic to Envoy, which configures SO_REUSEPORT on its sockets. In turn, we're blocked from removing TPROXY from our setup. The reason that bpf_sk_assign refuses such sockets is that the bpf_sk_lookup helpers don't execute SK_REUSEPORT programs. Instead, one of the reuseport sockets is selected by hash. This could cause dispatch to the "wrong" socket: sk = bpf_sk_lookup_tcp(...) // select SO_REUSEPORT by hash bpf_sk_assign(skb, sk) // SK_REUSEPORT wasn't executed Fixing this isn't as simple as invoking SK_REUSEPORT from the lookup helpers unfortunately. In the tc context, L2 headers are at the start of the skb, while SK_REUSEPORT expects L3 headers instead. Instead, we execute the SK_REUSEPORT program when the assigned socket is pulled out of the skb, further up the stack. This creates some trickiness with regards to refcounting as bpf_sk_assign will put both refcounted and RCU freed sockets in skb->sk. reuseport sockets are RCU freed. We can infer that the sk_assigned socket is RCU freed if the reuseport lookup succeeds, but convincing yourself of this fact isn't straight forward. Therefore we defensively check refcounting on the sk_assign sock even though it's probably not required in practice. Fixes: 8e368dc72e86 ("bpf: Fix use of sk->sk_reuseport from sk_assign") Fixes: cf7fbe660f2d ("bpf: Add socket assign support") Co-developed-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Joe Stringer <joe@cilium.io> Link: https://lore.kernel.org/bpf/CACAyw98+qycmpQzKupquhkxbvWK4OFyDuuLMBNROnfWMZxUWeA@mail.gmail.com/ Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Lorenz Bauer <lmb@isovalent.com> Link: https://lore.kernel.org/r/20230720-so-reuseport-v6-7-7021b683cdae@isovalent.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-07-19bpf: sync tools/ uapi header withAlan Maguire
Seeing the following: Warning: Kernel ABI header at 'tools/include/uapi/linux/bpf.h' differs from latest version at 'include/uapi/linux/bpf.h' ...so sync tools version missing some list_node/rb_tree fields. Fixes: c3c510ce431c ("bpf: Add 'owner' field to bpf_{list,rb}_node") Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Link: https://lore.kernel.org/r/20230719162257.20818-1-alan.maguire@oracle.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19bpf: Add fd-based tcx multi-prog infra with link supportDaniel Borkmann
This work refactors and adds a lightweight extension ("tcx") to the tc BPF ingress and egress data path side for allowing BPF program management based on fds via bpf() syscall through the newly added generic multi-prog API. The main goal behind this work which we also presented at LPC [0] last year and a recent update at LSF/MM/BPF this year [3] is to support long-awaited BPF link functionality for tc BPF programs, which allows for a model of safe ownership and program detachment. Given the rise in tc BPF users in cloud native environments, this becomes necessary to avoid hard to debug incidents either through stale leftover programs or 3rd party applications accidentally stepping on each others toes. As a recap, a BPF link represents the attachment of a BPF program to a BPF hook point. The BPF link holds a single reference to keep BPF program alive. Moreover, hook points do not reference a BPF link, only the application's fd or pinning does. A BPF link holds meta-data specific to attachment and implements operations for link creation, (atomic) BPF program update, detachment and introspection. The motivation for BPF links for tc BPF programs is multi-fold, for example: - From Meta: "It's especially important for applications that are deployed fleet-wide and that don't "control" hosts they are deployed to. If such application crashes and no one notices and does anything about that, BPF program will keep running draining resources or even just, say, dropping packets. We at FB had outages due to such permanent BPF attachment semantics. With fd-based BPF link we are getting a framework, which allows safe, auto-detachable behavior by default, unless application explicitly opts in by pinning the BPF link." [1] - From Cilium-side the tc BPF programs we attach to host-facing veth devices and phys devices build the core datapath for Kubernetes Pods, and they implement forwarding, load-balancing, policy, EDT-management, etc, within BPF. Currently there is no concept of 'safe' ownership, e.g. we've recently experienced hard-to-debug issues in a user's staging environment where another Kubernetes application using tc BPF attached to the same prio/handle of cls_bpf, accidentally wiping all Cilium-based BPF programs from underneath it. The goal is to establish a clear/safe ownership model via links which cannot accidentally be overridden. [0,2] BPF links for tc can co-exist with non-link attachments, and the semantics are in line also with XDP links: BPF links cannot replace other BPF links, BPF links cannot replace non-BPF links, non-BPF links cannot replace BPF links and lastly only non-BPF links can replace non-BPF links. In case of Cilium, this would solve mentioned issue of safe ownership model as 3rd party applications would not be able to accidentally wipe Cilium programs, even if they are not BPF link aware. Earlier attempts [4] have tried to integrate BPF links into core tc machinery to solve cls_bpf, which has been intrusive to the generic tc kernel API with extensions only specific to cls_bpf and suboptimal/complex since cls_bpf could be wiped from the qdisc also. Locking a tc BPF program in place this way, is getting into layering hacks given the two object models are vastly different. We instead implemented the tcx (tc 'express') layer which is an fd-based tc BPF attach API, so that the BPF link implementation blends in naturally similar to other link types which are fd-based and without the need for changing core tc internal APIs. BPF programs for tc can then be successively migrated from classic cls_bpf to the new tc BPF link without needing to change the program's source code, just the BPF loader mechanics for attaching is sufficient. For the current tc framework, there is no change in behavior with this change and neither does this change touch on tc core kernel APIs. The gist of this patch is that the ingress and egress hook have a lightweight, qdisc-less extension for BPF to attach its tc BPF programs, in other words, a minimal entry point for tc BPF. The name tcx has been suggested from discussion of earlier revisions of this work as a good fit, and to more easily differ between the classic cls_bpf attachment and the fd-based one. For the ingress and egress tcx points, the device holds a cache-friendly array with program pointers which is separated from control plane (slow-path) data. Earlier versions of this work used priority to determine ordering and expression of dependencies similar as with classic tc, but it was challenged that for something more future-proof a better user experience is required. Hence this resulted in the design and development of the generic attach/detach/query API for multi-progs. See prior patch with its discussion on the API design. tcx is the first user and later we plan to integrate also others, for example, one candidate is multi-prog support for XDP which would benefit and have the same 'look and feel' from API perspective. The goal with tcx is to have maximum compatibility to existing tc BPF programs, so they don't need to be rewritten specifically. Compatibility to call into classic tcf_classify() is also provided in order to allow successive migration or both to cleanly co-exist where needed given its all one logical tc layer and the tcx plus classic tc cls/act build one logical overall processing pipeline. tcx supports the simplified return codes TCX_NEXT which is non-terminating (go to next program) and terminating ones with TCX_PASS, TCX_DROP, TCX_REDIRECT. The fd-based API is behind a static key, so that when unused the code is also not entered. The struct tcx_entry's program array is currently static, but could be made dynamic if necessary at a point in future. The a/b pair swap design has been chosen so that for detachment there are no allocations which otherwise could fail. The work has been tested with tc-testing selftest suite which all passes, as well as the tc BPF tests from the BPF CI, and also with Cilium's L4LB. Thanks also to Nikolay Aleksandrov and Martin Lau for in-depth early reviews of this work. [0] https://lpc.events/event/16/contributions/1353/ [1] https://lore.kernel.org/bpf/CAEf4BzbokCJN33Nw_kg82sO=xppXnKWEncGTWCTB9vGCmLB6pw@mail.gmail.com [2] https://colocatedeventseu2023.sched.com/event/1Jo6O/tales-from-an-ebpf-programs-murder-mystery-hemanth-malla-guillaume-fournier-datadog [3] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf [4] https://lore.kernel.org/bpf/20210604063116.234316-1-memxor@gmail.com Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230719140858.13224-3-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19bpf: Add generic attach/detach/query API for multi-progsDaniel Borkmann
This adds a generic layer called bpf_mprog which can be reused by different attachment layers to enable multi-program attachment and dependency resolution. In-kernel users of the bpf_mprog don't need to care about the dependency resolution internals, they can just consume it with few API calls. The initial idea of having a generic API sparked out of discussion [0] from an earlier revision of this work where tc's priority was reused and exposed via BPF uapi as a way to coordinate dependencies among tc BPF programs, similar as-is for classic tc BPF. The feedback was that priority provides a bad user experience and is hard to use [1], e.g.: I cannot help but feel that priority logic copy-paste from old tc, netfilter and friends is done because "that's how things were done in the past". [...] Priority gets exposed everywhere in uapi all the way to bpftool when it's right there for users to understand. And that's the main problem with it. The user don't want to and don't need to be aware of it, but uapi forces them to pick the priority. [...] Your cover letter [0] example proves that in real life different service pick the same priority. They simply don't know any better. Priority is an unnecessary magic that apps _have_ to pick, so they just copy-paste and everyone ends up using the same. The course of the discussion showed more and more the need for a generic, reusable API where the "same look and feel" can be applied for various other program types beyond just tc BPF, for example XDP today does not have multi- program support in kernel, but also there was interest around this API for improving management of cgroup program types. Such common multi-program management concept is useful for BPF management daemons or user space BPF applications coordinating internally about their attachments. Both from Cilium and Meta side [2], we've collected the following requirements for a generic attach/detach/query API for multi-progs which has been implemented as part of this work: - Support prog-based attach/detach and link API - Dependency directives (can also be combined): - BPF_F_{BEFORE,AFTER} with relative_{fd,id} which can be {prog,link,none} - BPF_F_ID flag as {fd,id} toggle; the rationale for id is so that user space application does not need CAP_SYS_ADMIN to retrieve foreign fds via bpf_*_get_fd_by_id() - BPF_F_LINK flag as {prog,link} toggle - If relative_{fd,id} is none, then BPF_F_BEFORE will just prepend, and BPF_F_AFTER will just append for attaching - Enforced only at attach time - BPF_F_REPLACE with replace_bpf_fd which can be prog, links have their own infra for replacing their internal prog - If no flags are set, then it's default append behavior for attaching - Internal revision counter and optionally being able to pass expected_revision - User space application can query current state with revision, and pass it along for attachment to assert current state before doing updates - Query also gets extension for link_ids array and link_attach_flags: - prog_ids are always filled with program IDs - link_ids are filled with link IDs when link was used, otherwise 0 - {prog,link}_attach_flags for holding {prog,link}-specific flags - Must be easy to integrate/reuse for in-kernel users The uapi-side changes needed for supporting bpf_mprog are rather minimal, consisting of the additions of the attachment flags, revision counter, and expanding existing union with relative_{fd,id} member. The bpf_mprog framework consists of an bpf_mprog_entry object which holds an array of bpf_mprog_fp (fast-path structure). The bpf_mprog_cp (control-path structure) is part of bpf_mprog_bundle. Both have been separated, so that fast-path gets efficient packing of bpf_prog pointers for maximum cache efficiency. Also, array has been chosen instead of linked list or other structures to remove unnecessary indirections for a fast point-to-entry in tc for BPF. The bpf_mprog_entry comes as a pair via bpf_mprog_bundle so that in case of updates the peer bpf_mprog_entry is populated and then just swapped which avoids additional allocations that could otherwise fail, for example, in detach case. bpf_mprog_{fp,cp} arrays are currently static, but they could be converted to dynamic allocation if necessary at a point in future. Locking is deferred to the in-kernel user of bpf_mprog, for example, in case of tcx which uses this API in the next patch, it piggybacks on rtnl. An extensive test suite for checking all aspects of this API for prog-based attach/detach and link API comes as BPF selftests in this series. Thanks also to Andrii Nakryiko for early API discussions wrt Meta's BPF prog management. [0] https://lore.kernel.org/bpf/20221004231143.19190-1-daniel@iogearbox.net [1] https://lore.kernel.org/bpf/CAADnVQ+gEY3FjCR=+DmjDR4gp5bOYZUFJQXj4agKFHT9CQPZBw@mail.gmail.com [2] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20230719140858.13224-2-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-11bpf: Support ->fill_link_info for perf_eventYafang Shao
By introducing support for ->fill_link_info to the perf_event link, users gain the ability to inspect it using `bpftool link show`. While the current approach involves accessing this information via `bpftool perf show`, consolidating link information for all link types in one place offers greater convenience. Additionally, this patch extends support to the generic perf event, which is not currently accommodated by `bpftool perf show`. While only the perf type and config are exposed to userspace, other attributes such as sample_period and sample_freq are ignored. It's important to note that if kptr_restrict is not permitted, the probed address will not be exposed, maintaining security measures. A new enum bpf_perf_event_type is introduced to help the user understand which struct is relevant. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20230709025630.3735-9-laoar.shao@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-11bpf: Support ->fill_link_info for kprobe_multiYafang Shao
With the addition of support for fill_link_info to the kprobe_multi link, users will gain the ability to inspect it conveniently using the `bpftool link show`. This enhancement provides valuable information to the user, including the count of probed functions and their respective addresses. It's important to note that if the kptr_restrict setting is not permitted, the probed address will not be exposed, ensuring security. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230709025630.3735-2-laoar.shao@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-06-24Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2023-06-23 We've added 49 non-merge commits during the last 24 day(s) which contain a total of 70 files changed, 1935 insertions(+), 442 deletions(-). The main changes are: 1) Extend bpf_fib_lookup helper to allow passing the route table ID, from Louis DeLosSantos. 2) Fix regsafe() in verifier to call check_ids() for scalar registers, from Eduard Zingerman. 3) Extend the set of cpumask kfuncs with bpf_cpumask_first_and() and a rework of bpf_cpumask_any*() kfuncs. Additionally, add selftests, from David Vernet. 4) Fix socket lookup BPF helpers for tc/XDP to respect VRF bindings, from Gilad Sever. 5) Change bpf_link_put() to use workqueue unconditionally to fix it under PREEMPT_RT, from Sebastian Andrzej Siewior. 6) Follow-ups to address issues in the bpf_refcount shared ownership implementation, from Dave Marchevsky. 7) A few general refactorings to BPF map and program creation permissions checks which were part of the BPF token series, from Andrii Nakryiko. 8) Various fixes for benchmark framework and add a new benchmark for BPF memory allocator to BPF selftests, from Hou Tao. 9) Documentation improvements around iterators and trusted pointers, from Anton Protopopov. 10) Small cleanup in verifier to improve allocated object check, from Daniel T. Lee. 11) Improve performance of bpf_xdp_pointer() by avoiding access to shared_info when XDP packet does not have frags, from Jesper Dangaard Brouer. 12) Silence a harmless syzbot-reported warning in btf_type_id_size(), from Yonghong Song. 13) Remove duplicate bpfilter_umh_cleanup in favor of umd_cleanup_helper, from Jarkko Sakkinen. 14) Fix BPF selftests build for resolve_btfids under custom HOSTCFLAGS, from Viktor Malik. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (49 commits) bpf, docs: Document existing macros instead of deprecated bpf, docs: BPF Iterator Document selftests/bpf: Fix compilation failure for prog vrf_socket_lookup selftests/bpf: Add vrf_socket_lookup tests bpf: Fix bpf socket lookup from tc/xdp to respect socket VRF bindings bpf: Call __bpf_sk_lookup()/__bpf_skc_lookup() directly via TC hookpoint bpf: Factor out socket lookup functions for the TC hookpoint. selftests/bpf: Set the default value of consumer_cnt as 0 selftests/bpf: Ensure that next_cpu() returns a valid CPU number selftests/bpf: Output the correct error code for pthread APIs selftests/bpf: Use producer_cnt to allocate local counter array xsk: Remove unused inline function xsk_buff_discard() bpf: Keep BPF_PROG_LOAD permission checks clear of validations bpf: Centralize permissions checks for all BPF map types bpf: Inline map creation logic in map_create() function bpf: Move unprivileged checks into map_create() and bpf_prog_load() bpf: Remove in_atomic() from bpf_link_put(). selftests/bpf: Verify that check_ids() is used for scalars in regsafe() bpf: Verify scalar ids mapping in regsafe() using check_ids() selftests/bpf: Check if mark_chain_precision() follows scalar ids ... ==================== Link: https://lore.kernel.org/r/20230623211256.8409-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. Conflicts: net/sched/sch_taprio.c d636fc5dd692 ("net: sched: add rcu annotations around qdisc->qdisc_sleeping") dced11ef84fb ("net/sched: taprio: don't overwrite "sch" variable in taprio_dump_class_stats()") net/ipv4/sysctl_net_ipv4.c e209fee4118f ("net/ipv4: ping_group_range: allow GID from 2147483648 to 4294967294") ccce324dabfe ("tcp: make the first N SYN RTO backoffs linear") https://lore.kernel.org/all/20230605100816.08d41a7b@canb.auug.org.au/ No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-05bpf: netfilter: Add BPF_NETFILTER bpf_attach_typeFlorian Westphal
Andrii Nakryiko writes: And we currently don't have an attach type for NETLINK BPF link. Thankfully it's not too late to add it. I see that link_create() in kernel/bpf/syscall.c just bypasses attach_type check. We shouldn't have done that. Instead we need to add BPF_NETLINK attach type to enum bpf_attach_type. And wire all that properly throughout the kernel and libbpf itself. This adds BPF_NETFILTER and uses it. This breaks uabi but this wasn't in any non-rc release yet, so it should be fine. v2: check link_attack prog type in link_create too Fixes: 84601d6ee68a ("bpf: add bpf_link support for BPF_NETFILTER programs") Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/CAEf4BzZ69YgrQW7DHCJUT_X+GqMq_ZQQPBwopaJJVGFD5=d5Vg@mail.gmail.com/ Link: https://lore.kernel.org/bpf/20230605131445.32016-1-fw@strlen.de
2023-06-01bpf: Add table ID to bpf_fib_lookup BPF helperLouis DeLosSantos
Add ability to specify routing table ID to the `bpf_fib_lookup` BPF helper. A new field `tbid` is added to `struct bpf_fib_lookup` used as parameters to the `bpf_fib_lookup` BPF helper. When the helper is called with the `BPF_FIB_LOOKUP_DIRECT` and `BPF_FIB_LOOKUP_TBID` flags the `tbid` field in `struct bpf_fib_lookup` will be used as the table ID for the fib lookup. If the `tbid` does not exist the fib lookup will fail with `BPF_FIB_LKUP_RET_NOT_FWDED`. The `tbid` field becomes a union over the vlan related output fields in `struct bpf_fib_lookup` and will be zeroed immediately after usage. This functionality is useful in containerized environments. For instance, if a CNI wants to dictate the next-hop for traffic leaving a container it can create a container-specific routing table and perform a fib lookup against this table in a "host-net-namespace-side" TC program. This functionality also allows `ip rule` like functionality at the TC layer, allowing an eBPF program to pick a routing table based on some aspect of the sk_buff. As a concrete use case, this feature will be used in Cilium's SRv6 L3VPN datapath. When egress traffic leaves a Pod an eBPF program attached by Cilium will determine which VRF the egress traffic should target, and then perform a FIB lookup in a specific table representing this VRF's FIB. Signed-off-by: Louis DeLosSantos <louis.delos.devel@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230505-bpf-add-tbid-fib-lookup-v2-1-0a31c22c748c@gmail.com
2023-05-23bpf: Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commandsAndrii Nakryiko
Current UAPI of BPF_OBJ_PIN and BPF_OBJ_GET commands of bpf() syscall forces users to specify pinning location as a string-based absolute or relative (to current working directory) path. This has various implications related to security (e.g., symlink-based attacks), forces BPF FS to be exposed in the file system, which can cause races with other applications. One of the feedbacks we got from folks working with containers heavily was that inability to use purely FD-based location specification was an unfortunate limitation and hindrance for BPF_OBJ_PIN and BPF_OBJ_GET commands. This patch closes this oversight, adding path_fd field to BPF_OBJ_PIN and BPF_OBJ_GET UAPI, following conventions established by *at() syscalls for dirfd + pathname combinations. This now allows interesting possibilities like working with detached BPF FS mount (e.g., to perform multiple pinnings without running a risk of someone interfering with them), and generally making pinning/getting more secure and not prone to any races and/or security attacks. This is demonstrated by a selftest added in subsequent patch that takes advantage of new mount APIs (fsopen, fsconfig, fsmount) to demonstrate creating detached BPF FS mount, pinning, and then getting BPF map out of it, all while never exposing this private instance of BPF FS to outside worlds. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/bpf/20230523170013.728457-4-andrii@kernel.org
2023-04-21tools: bpftool: print netfilter link infoFlorian Westphal
Dump protocol family, hook and priority value: $ bpftool link 2: netfilter prog 14 ip input prio -128 pids install(3264) 5: netfilter prog 14 ip6 forward prio 21 pids a.out(3387) 9: netfilter prog 14 ip prerouting prio 123 pids a.out(5700) 10: netfilter prog 14 ip input prio 21 pids test2(5701) v2: Quentin Monnet suggested to also add 'bpftool net' support: $ bpftool net xdp: tc: flow_dissector: netfilter: ip prerouting prio 21 prog_id 14 ip input prio -128 prog_id 14 ip input prio 21 prog_id 14 ip forward prio 21 prog_id 14 ip output prio 21 prog_id 14 ip postrouting prio 21 prog_id 14 'bpftool net' only dumps netfilter link type, links are sorted by protocol family, hook and priority. v5: fix bpf ci failure: libbpf needs small update to prog_type_name[] and probe_prog_load helper. v4: don't fail with -EOPNOTSUPP in libbpf probe_prog_load, update prog_type_name[] with "netfilter" entry (bpf ci) v3: fix bpf.h copy, 'reserved' member was removed (Alexei) use p_err, not fprintf (Quentin) Suggested-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/bpf/eeeaac99-9053-90c2-aa33-cc1ecb1ae9ca@isovalent.com/ Reviewed-by: Quentin Monnet <quentin@isovalent.com> Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20230421170300.24115-6-fw@strlen.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-15bpf: Introduce opaque bpf_refcount struct and add btf_record plumbingDave Marchevsky
A 'struct bpf_refcount' is added to the set of opaque uapi/bpf.h types meant for use in BPF programs. Similarly to other opaque types like bpf_spin_lock and bpf_rbtree_node, the verifier needs to know where in user-defined struct types a bpf_refcount can be located, so necessary btf_record plumbing is added to enable this. bpf_refcount is sized to hold a refcount_t. Similarly to bpf_spin_lock, the offset of a bpf_refcount is cached in btf_record as refcount_off in addition to being in the field array. Caching refcount_off makes sense for this field because further patches in the series will modify functions that take local kptrs (e.g. bpf_obj_drop) to change their behavior if the type they're operating on is refcounted. So enabling fast "is this type refcounted?" checks is desirable. No such verifier behavior changes are introduced in this patch, just logic to recognize 'struct bpf_refcount' in btf_record. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230415201811.343116-3-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-04-11bpf: Add log_true_size output field to return necessary log buffer sizeAndrii Nakryiko
Add output-only log_true_size and btf_log_true_size field to BPF_PROG_LOAD and BPF_BTF_LOAD commands, respectively. It will return the size of log buffer necessary to fit in all the log contents at specified log_level. This is very useful for BPF loader libraries like libbpf to be able to size log buffer correctly, but could be used by users directly, if necessary, as well. This patch plumbs all this through the code, taking into account actual bpf_attr size provided by user to determine if these new fields are expected by users. And if they are, set them from kernel on return. We refactory btf_parse() function to accommodate this, moving attr and uattr handling inside it. The rest is very straightforward code, which is split from the logging accounting changes in the previous patch to make it simpler to review logic vs UAPI changes. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Lorenz Bauer <lmb@isovalent.com> Link: https://lore.kernel.org/bpf/20230406234205.323208-13-andrii@kernel.org
2023-03-22bpf: Update the struct_ops of a bpf_link.Kui-Feng Lee
By improving the BPF_LINK_UPDATE command of bpf(), it should allow you to conveniently switch between different struct_ops on a single bpf_link. This would enable smoother transitions from one struct_ops to another. The struct_ops maps passing along with BPF_LINK_UPDATE should have the BPF_F_LINK flag. Signed-off-by: Kui-Feng Lee <kuifeng@meta.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230323032405.3735486-6-kuifeng@meta.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-03-22bpf: Create links for BPF struct_ops maps.Kui-Feng Lee
Make bpf_link support struct_ops. Previously, struct_ops were always used alone without any associated links. Upon updating its value, a struct_ops would be activated automatically. Yet other BPF program types required to make a bpf_link with their instances before they could become active. Now, however, you can create an inactive struct_ops, and create a link to activate it later. With bpf_links, struct_ops has a behavior similar to other BPF program types. You can pin/unpin them from their links and the struct_ops will be deactivated when its link is removed while previously need someone to delete the value for it to be deactivated. bpf_links are responsible for registering their associated struct_ops. You can only use a struct_ops that has the BPF_F_LINK flag set to create a bpf_link, while a structs without this flag behaves in the same manner as before and is registered upon updating its value. The BPF_LINK_TYPE_STRUCT_OPS serves a dual purpose. Not only is it used to craft the links for BPF struct_ops programs, but also to create links for BPF struct_ops them-self. Since the links of BPF struct_ops programs are only used to create trampolines internally, they are never seen in other contexts. Thus, they can be reused for struct_ops themself. To maintain a reference to the map supporting this link, we add bpf_struct_ops_link as an additional type. The pointer of the map is RCU and won't be necessary until later in the patchset. Signed-off-by: Kui-Feng Lee <kuifeng@meta.com> Link: https://lore.kernel.org/r/20230323032405.3735486-4-kuifeng@meta.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-03-13bpf: use canonical ftrace pathRoss Zwisler
The canonical location for the tracefs filesystem is at /sys/kernel/tracing. But, from Documentation/trace/ftrace.rst: Before 4.1, all ftrace tracing control files were within the debugfs file system, which is typically located at /sys/kernel/debug/tracing. For backward compatibility, when mounting the debugfs file system, the tracefs file system will be automatically mounted at: /sys/kernel/debug/tracing Many comments and samples in the bpf code still refer to this older debugfs path, so let's update them to avoid confusion. There are a few spots where the bpf code explicitly checks both tracefs and debugfs (tools/bpf/bpftool/tracelog.c and tools/lib/api/fs/fs.c) and I've left those alone so that the tools can continue to work with both paths. Signed-off-by: Ross Zwisler <zwisler@google.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/r/20230313205628.1058720-2-zwisler@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-09bpf: Fix a typo for BPF_F_ANY_ALIGNMENT in bpf.hMichael Weiß
Fix s/BPF_PROF_LOAD/BPF_PROG_LOAD/ typo in the documentation comment for BPF_F_ANY_ALIGNMENT in bpf.h. Signed-off-by: Michael Weiß <michael.weiss@aisec.fraunhofer.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230309133823.944097-1-michael.weiss@aisec.fraunhofer.de
2023-03-08bpf: implement numbers iteratorAndrii Nakryiko
Implement the first open-coded iterator type over a range of integers. It's public API consists of: - bpf_iter_num_new() constructor, which accepts [start, end) range (that is, start is inclusive, end is exclusive). - bpf_iter_num_next() which will keep returning read-only pointer to int until the range is exhausted, at which point NULL will be returned. If bpf_iter_num_next() is kept calling after this, NULL will be persistently returned. - bpf_iter_num_destroy() destructor, which needs to be called at some point to clean up iterator state. BPF verifier enforces that iterator destructor is called at some point before BPF program exits. Note that `start = end = X` is a valid combination to setup an empty iterator. bpf_iter_num_new() will return 0 (success) for any such combination. If bpf_iter_num_new() detects invalid combination of input arguments, it returns error, resets iterator state to, effectively, empty iterator, so any subsequent call to bpf_iter_num_next() will keep returning NULL. BPF verifier has no knowledge that returned integers are in the [start, end) value range, as both `start` and `end` are not statically known and enforced: they are runtime values. While the implementation is pretty trivial, some care needs to be taken to avoid overflows and underflows. Subsequent selftests will validate correctness of [start, end) semantics, especially around extremes (INT_MIN and INT_MAX). Similarly to bpf_loop(), we enforce that no more than BPF_MAX_LOOPS can be specified. bpf_iter_num_{new,next,destroy}() is a logical evolution from bounded BPF loops and bpf_loop() helper and is the basis for implementing ergonomic BPF loops with no statically known or verified bounds. Subsequent patches implement bpf_for() macro, demonstrating how this can be wrapped into something that works and feels like a normal for() loop in C language. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230308184121.1165081-5-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-02bpf: Add support for absolute value BPF timersTero Kristo
Add a new flag BPF_F_TIMER_ABS that can be passed to bpf_timer_start() to start an absolute value timer instead of the default relative value. This makes the timer expire at an exact point in time, instead of a time with latencies induced by both the BPF and timer subsystems. Suggested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Tero Kristo <tero.kristo@linux.intel.com> Link: https://lore.kernel.org/r/20230302114614.2985072-2-tero.kristo@linux.intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-01bpf: Add bpf_dynptr_slice and bpf_dynptr_slice_rdwrJoanne Koong
Two new kfuncs are added, bpf_dynptr_slice and bpf_dynptr_slice_rdwr. The user must pass in a buffer to store the contents of the data slice if a direct pointer to the data cannot be obtained. For skb and xdp type dynptrs, these two APIs are the only way to obtain a data slice. However, for other types of dynptrs, there is no difference between bpf_dynptr_slice(_rdwr) and bpf_dynptr_data. For skb type dynptrs, the data is copied into the user provided buffer if any of the data is not in the linear portion of the skb. For xdp type dynptrs, the data is copied into the user provided buffer if the data is between xdp frags. If the skb is cloned and a call to bpf_dynptr_data_rdwr is made, then the skb will be uncloned (see bpf_unclone_prologue()). Please note that any bpf_dynptr_write() automatically invalidates any prior data slices of the skb dynptr. This is because the skb may be cloned or may need to pull its paged buffer into the head. As such, any bpf_dynptr_write() will automatically have its prior data slices invalidated, even if the write is to data in the skb head of an uncloned skb. Please note as well that any other helper calls that change the underlying packet buffer (eg bpf_skb_pull_data()) invalidates any data slices of the skb dynptr as well, for the same reasons. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Link: https://lore.kernel.org/r/20230301154953.641654-10-joannelkoong@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-01bpf: Add xdp dynptrsJoanne Koong
Add xdp dynptrs, which are dynptrs whose underlying pointer points to a xdp_buff. The dynptr acts on xdp data. xdp dynptrs have two main benefits. One is that they allow operations on sizes that are not statically known at compile-time (eg variable-sized accesses). Another is that parsing the packet data through dynptrs (instead of through direct access of xdp->data and xdp->data_end) can be more ergonomic and less brittle (eg does not need manual if checking for being within bounds of data_end). For reads and writes on the dynptr, this includes reading/writing from/to and across fragments. Data slices through the bpf_dynptr_data API are not supported; instead bpf_dynptr_slice() and bpf_dynptr_slice_rdwr() should be used. For examples of how xdp dynptrs can be used, please see the attached selftests. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Link: https://lore.kernel.org/r/20230301154953.641654-9-joannelkoong@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-03-01bpf: Add skb dynptrsJoanne Koong
Add skb dynptrs, which are dynptrs whose underlying pointer points to a skb. The dynptr acts on skb data. skb dynptrs have two main benefits. One is that they allow operations on sizes that are not statically known at compile-time (eg variable-sized accesses). Another is that parsing the packet data through dynptrs (instead of through direct access of skb->data and skb->data_end) can be more ergonomic and less brittle (eg does not need manual if checking for being within bounds of data_end). For bpf prog types that don't support writes on skb data, the dynptr is read-only (bpf_dynptr_write() will return an error) For reads and writes through the bpf_dynptr_read() and bpf_dynptr_write() interfaces, reading and writing from/to data in the head as well as from/to non-linear paged buffers is supported. Data slices through the bpf_dynptr_data API are not supported; instead bpf_dynptr_slice() and bpf_dynptr_slice_rdwr() (added in subsequent commit) should be used. For examples of how skb dynptrs can be used, please see the attached selftests. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Link: https://lore.kernel.org/r/20230301154953.641654-8-joannelkoong@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-17bpf: Add BPF_FIB_LOOKUP_SKIP_NEIGH for bpf_fib_lookupMartin KaFai Lau
The bpf_fib_lookup() also looks up the neigh table. This was done before bpf_redirect_neigh() was added. In the use case that does not manage the neigh table and requires bpf_fib_lookup() to lookup a fib to decide if it needs to redirect or not, the bpf prog can depend only on using bpf_redirect_neigh() to lookup the neigh. It also keeps the neigh entries fresh and connected. This patch adds a bpf_fib_lookup flag, SKIP_NEIGH, to avoid the double neigh lookup when the bpf prog always call bpf_redirect_neigh() to do the neigh lookup. The params->smac output is skipped together when SKIP_NEIGH is set because bpf_redirect_neigh() will figure out the smac also. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230217205515.3583372-1-martin.lau@linux.dev
2023-02-13bpf: Add basic bpf_rb_{root,node} supportDave Marchevsky
This patch adds special BPF_RB_{ROOT,NODE} btf_field_types similar to BPF_LIST_{HEAD,NODE}, adds the necessary plumbing to detect the new types, and adds bpf_rb_root_free function for freeing bpf_rb_root in map_values. structs bpf_rb_root and bpf_rb_node are opaque types meant to obscure structs rb_root_cached rb_node, respectively. btf_struct_access will prevent BPF programs from touching these special fields automatically now that they're recognized. btf_check_and_fixup_fields now groups list_head and rb_root together as "graph root" fields and {list,rb}_node as "graph node", and does same ownership cycle checking as before. Note that this function does _not_ prevent ownership type mixups (e.g. rb_root owning list_node) - that's handled by btf_parse_graph_root. After this patch, a bpf program can have a struct bpf_rb_root in a map_value, but not add anything to nor do anything useful with it. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230214004017.2534011-2-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-03bpf: fix typo in header for bpf_perf_prog_read_valueFlorian Lehner
Fix a simple typo in the documentation for bpf_perf_prog_read_value. Signed-off-by: Florian Lehner <dev@der-flo.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/20230203121439.25884-1-dev@der-flo.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-02-02tools/bpf: Use tab instead of white spaces to sync bpf.hTiezhu Yang
Just silence the following build warning: Warning: Kernel ABI header at 'tools/include/uapi/linux/bpf.h' differs from latest version at 'include/uapi/linux/bpf.h' Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn> Link: https://lore.kernel.org/r/1675319486-27744-2-git-send-email-yangtiezhu@loongson.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-01-23bpf: Introduce device-bound XDP programsStanislav Fomichev
New flag BPF_F_XDP_DEV_BOUND_ONLY plus all the infra to have a way to associate a netdev with a BPF program at load time. netdevsim checks are dropped in favor of generic check in dev_xdp_attach. Cc: John Fastabend <john.fastabend@gmail.com> Cc: David Ahern <dsahern@gmail.com> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Willem de Bruijn <willemb@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Anatoly Burakov <anatoly.burakov@intel.com> Cc: Alexander Lobakin <alexandr.lobakin@intel.com> Cc: Magnus Karlsson <magnus.karlsson@gmail.com> Cc: Maryam Tahhan <mtahhan@redhat.com> Cc: xdp-hints@xdp-project.net Cc: netdev@vger.kernel.org Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20230119221536.3349901-6-sdf@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-01-15bpf: Add ipip6 and ip6ip decap support for bpf_skb_adjust_room()Ziyang Xuan
Add ipip6 and ip6ip decap support for bpf_skb_adjust_room(). Main use case is for using cls_bpf on ingress hook to decapsulate IPv4 over IPv6 and IPv6 over IPv4 tunnel packets. Add two new flags BPF_F_ADJ_ROOM_DECAP_L3_IPV{4,6} to indicate the new IP header version after decapsulating the outer IP header. Suggested-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/b268ec7f0ff9431f4f43b1b40ab856ebb28cb4e1.1673574419.git.william.xuanziyang@huawei.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-12-19bpf: Add flag BPF_F_NO_TUNNEL_KEY to bpf_skb_set_tunnel_key()Christian Ehrig
This patch allows to remove TUNNEL_KEY from the tunnel flags bitmap when using bpf_skb_set_tunnel_key by providing a BPF_F_NO_TUNNEL_KEY flag. On egress, the resulting tunnel header will not contain a tunnel key if the protocol and implementation supports it. At the moment bpf_tunnel_key wants a user to specify a numeric tunnel key. This will wrap the inner packet into a tunnel header with the key bit and value set accordingly. This is problematic when using a tunnel protocol that supports optional tunnel keys and a receiving tunnel device that is not expecting packets with the key bit set. The receiver won't decapsulate and drop the packet. RFC 2890 and RFC 2784 GRE tunnels are examples where this flag is useful. It allows for generating packets, that can be decapsulated by a GRE tunnel device not operating in collect metadata mode or not expecting the key bit set. Signed-off-by: Christian Ehrig <cehrig@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/bpf/20221218051734.31411-1-cehrig@cloudflare.com
2022-12-08bpf: Rework process_dynptr_funcKumar Kartikeya Dwivedi
Recently, user ringbuf support introduced a PTR_TO_DYNPTR register type for use in callback state, because in case of user ringbuf helpers, there is no dynptr on the stack that is passed into the callback. To reflect such a state, a special register type was created. However, some checks have been bypassed incorrectly during the addition of this feature. First, for arg_type with MEM_UNINIT flag which initialize a dynptr, they must be rejected for such register type. Secondly, in the future, there are plans to add dynptr helpers that operate on the dynptr itself and may change its offset and other properties. In all of these cases, PTR_TO_DYNPTR shouldn't be allowed to be passed to such helpers, however the current code simply returns 0. The rejection for helpers that release the dynptr is already handled. For fixing this, we take a step back and rework existing code in a way that will allow fitting in all classes of helpers and have a coherent model for dealing with the variety of use cases in which dynptr is used. First, for ARG_PTR_TO_DYNPTR, it can either be set alone or together with a DYNPTR_TYPE_* constant that denotes the only type it accepts. Next, helpers which initialize a dynptr use MEM_UNINIT to indicate this fact. To make the distinction clear, use MEM_RDONLY flag to indicate that the helper only operates on the memory pointed to by the dynptr, not the dynptr itself. In C parlance, it would be equivalent to taking the dynptr as a point to const argument. When either of these flags are not present, the helper is allowed to mutate both the dynptr itself and also the memory it points to. Currently, the read only status of the memory is not tracked in the dynptr, but it would be trivial to add this support inside dynptr state of the register. With these changes and renaming PTR_TO_DYNPTR to CONST_PTR_TO_DYNPTR to better reflect its usage, it can no longer be passed to helpers that initialize a dynptr, i.e. bpf_dynptr_from_mem, bpf_ringbuf_reserve_dynptr. A note to reviewers is that in code that does mark_stack_slots_dynptr, and unmark_stack_slots_dynptr, we implicitly rely on the fact that PTR_TO_STACK reg is the only case that can reach that code path, as one cannot pass CONST_PTR_TO_DYNPTR to helpers that don't set MEM_RDONLY. In both cases such helpers won't be setting that flag. The next patch will add a couple of selftest cases to make sure this doesn't break. Fixes: 205715673844 ("bpf: Add bpf_user_ringbuf_drain() helper") Acked-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221207204141.308952-4-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-23bpf: Update bpf_{g,s}etsockopt() documentationJi Rongfeng
* append missing optnames to the end * simplify bpf_getsockopt()'s doc Signed-off-by: Ji Rongfeng <SikoJobs@outlook.com> Link: https://lore.kernel.org/r/DU0P192MB15479B86200B1216EC90E162D6099@DU0P192MB1547.EURP192.PROD.OUTLOOK.COM Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-11-14bpf: Support bpf_list_head in map valuesKumar Kartikeya Dwivedi
Add the support on the map side to parse, recognize, verify, and build metadata table for a new special field of the type struct bpf_list_head. To parameterize the bpf_list_head for a certain value type and the list_node member it will accept in that value type, we use BTF declaration tags. The definition of bpf_list_head in a map value will be done as follows: struct foo { struct bpf_list_node node; int data; }; struct map_value { struct bpf_list_head head __contains(foo, node); }; Then, the bpf_list_head only allows adding to the list 'head' using the bpf_list_node 'node' for the type struct foo. The 'contains' annotation is a BTF declaration tag composed of four parts, "contains:name:node" where the name is then used to look up the type in the map BTF, with its kind hardcoded to BTF_KIND_STRUCT during the lookup. The node defines name of the member in this type that has the type struct bpf_list_node, which is actually used for linking into the linked list. For now, 'kind' part is hardcoded as struct. This allows building intrusive linked lists in BPF, using container_of to obtain pointer to entry, while being completely type safe from the perspective of the verifier. The verifier knows exactly the type of the nodes, and knows that list helpers return that type at some fixed offset where the bpf_list_node member used for this list exists. The verifier also uses this information to disallow adding types that are not accepted by a certain list. For now, no elements can be added to such lists. Support for that is coming in future patches, hence draining and freeing items is done with a TODO that will be resolved in a future patch. Note that the bpf_list_head_free function moves the list out to a local variable under the lock and releases it, doing the actual draining of the list items outside the lock. While this helps with not holding the lock for too long pessimizing other concurrent list operations, it is also necessary for deadlock prevention: unless every function called in the critical section would be notrace, a fentry/fexit program could attach and call bpf_map_update_elem again on the map, leading to the same lock being acquired if the key matches and lead to a deadlock. While this requires some special effort on part of the BPF programmer to trigger and is highly unlikely to occur in practice, it is always better if we can avoid such a condition. While notrace would prevent this, doing the draining outside the lock has advantages of its own, hence it is used to also fix the deadlock related problem. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221114191547.1694267-5-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-11bpf: Add hwtstamp field for the sockops progMartin KaFai Lau
The bpf-tc prog has already been able to access the skb_hwtstamps(skb)->hwtstamp. This patch extends the same hwtstamp access to the sockops prog. In sockops, the skb is also available to the bpf prog during the BPF_SOCK_OPS_PARSE_HDR_OPT_CB event. There is a use case that the hwtstamp will be useful to the sockops prog to better measure the one-way-delay when the sender has put the tx timestamp in the tcp header option. Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20221107230420.4192307-2-martin.lau@linux.dev
2022-10-25bpf: Implement cgroup storage available to non-cgroup-attached bpf progsYonghong Song
Similar to sk/inode/task storage, implement similar cgroup local storage. There already exists a local storage implementation for cgroup-attached bpf programs. See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper bpf_get_local_storage(). But there are use cases such that non-cgroup attached bpf progs wants to access cgroup local storage data. For example, tc egress prog has access to sk and cgroup. It is possible to use sk local storage to emulate cgroup local storage by storing data in socket. But this is a waste as it could be lots of sockets belonging to a particular cgroup. Alternatively, a separate map can be created with cgroup id as the key. But this will introduce additional overhead to manipulate the new map. A cgroup local storage, similar to existing sk/inode/task storage, should help for this use case. The life-cycle of storage is managed with the life-cycle of the cgroup struct. i.e. the storage is destroyed along with the owning cgroup with a call to bpf_cgrp_storage_free() when cgroup itself is deleted. The userspace map operations can be done by using a cgroup fd as a key passed to the lookup, update and delete operations. Typically, the following code is used to get the current cgroup: struct task_struct *task = bpf_get_current_task_btf(); ... task->cgroups->dfl_cgrp ... and in structure task_struct definition: struct task_struct { .... struct css_set __rcu *cgroups; .... } With sleepable program, accessing task->cgroups is not protected by rcu_read_lock. So the current implementation only supports non-sleepable program and supporting sleepable program will be the next step together with adding rcu_read_lock protection for rcu tagged structures. Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup local storage support, the new map name BPF_MAP_TYPE_CGRP_STORAGE is used for cgroup storage available to non-cgroup-attached bpf programs. The old cgroup storage supports bpf_get_local_storage() helper to get the cgroup data. The new cgroup storage helper bpf_cgrp_storage_get() can provide similar functionality. While old cgroup storage pre-allocates storage memory, the new mechanism can also pre-allocate with a user space bpf_map_update_elem() call to avoid potential run-time memory allocation failure. Therefore, the new cgroup storage can provide all functionality w.r.t. the old one. So in uapi bpf.h, the old BPF_MAP_TYPE_CGROUP_STORAGE is alias to BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED to indicate the old cgroup storage can be deprecated since the new one can provide the same functionality. Acked-by: David Vernet <void@manifault.com> Signed-off-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20221026042850.673791-1-yhs@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-06bpf: explicitly define BPF_FUNC_xxx integer valuesAndrii Nakryiko
Historically enum bpf_func_id's BPF_FUNC_xxx enumerators relied on implicit sequential values being assigned by compiler. This is convenient, as new BPF helpers are always added at the very end, but it also has its downsides, some of them being: - with over 200 helpers now it's very hard to know what's each helper's ID, which is often important to know when working with BPF assembly (e.g., by dumping raw bpf assembly instructions with llvm-objdump -d command). it's possible to work around this by looking into vmlinux.h, dumping /sys/btf/kernel/vmlinux, looking at libbpf-provided bpf_helper_defs.h, etc. But it always feels like an unnecessary step and one should be able to quickly figure this out from UAPI header. - when backporting and cherry-picking only some BPF helpers onto older kernels it's important to be able to skip some enum values for helpers that weren't backported, but preserve absolute integer IDs to keep BPF helper IDs stable so that BPF programs stay portable across upstream and backported kernels. While neither problem is insurmountable, they come up frequently enough and are annoying enough to warrant improving the situation. And for the backporting the problem can easily go unnoticed for a while, especially if backport is done with people not very familiar with BPF subsystem overall. Anyways, it's easy to fix this by making sure that __BPF_FUNC_MAPPER macro provides explicit helper IDs. Unfortunately that would potentially break existing users that use UAPI-exposed __BPF_FUNC_MAPPER and are expected to pass macro that accepts only symbolic helper identifier (e.g., map_lookup_elem for bpf_map_lookup_elem() helper). As such, we need to introduce a new macro (___BPF_FUNC_MAPPER) which would specify both identifier and integer ID, but in such a way as to allow existing __BPF_FUNC_MAPPER be expressed in terms of new ___BPF_FUNC_MAPPER macro. And that's what this patch is doing. To avoid duplication and allow __BPF_FUNC_MAPPER stay *exactly* the same, ___BPF_FUNC_MAPPER accepts arbitrary "context" arguments, which can be used to pass any extra macros, arguments, and whatnot. In our case we use this to pass original user-provided macro that expects single argument and __BPF_FUNC_MAPPER is using it's own three-argument __BPF_FUNC_MAPPER_APPLY intermediate macro to impedance-match new and old "callback" macros. Once we resolve this, we use new ___BPF_FUNC_MAPPER to define enum bpf_func_id with explicit values. The other users of __BPF_FUNC_MAPPER in kernel (namely in kernel/bpf/disasm.c) are kept exactly the same both as demonstration that backwards compat works, but also to avoid unnecessary code churn. Note that new ___BPF_FUNC_MAPPER() doesn't forcefully insert comma between values, as that might not be appropriate in all possible cases where ___BPF_FUNC_MAPPER might be used by users. This doesn't reduce usability, as it's trivial to insert that comma inside "callback" macro. To validate all the manually specified IDs are exactly right, we used BTF to compare before and after values: $ bpftool btf dump file ~/linux-build/default/vmlinux | rg bpf_func_id -A 211 > after.txt $ git stash # stach UAPI changes $ make -j90 ... re-building kernel without UAPI changes ... $ bpftool btf dump file ~/linux-build/default/vmlinux | rg bpf_func_id -A 211 > before.txt $ diff -u before.txt after.txt --- before.txt 2022-10-05 10:48:18.119195916 -0700 +++ after.txt 2022-10-05 10:46:49.446615025 -0700 @@ -1,4 +1,4 @@ -[14576] ENUM 'bpf_func_id' encoding=UNSIGNED size=4 vlen=211 +[9560] ENUM 'bpf_func_id' encoding=UNSIGNED size=4 vlen=211 'BPF_FUNC_unspec' val=0 'BPF_FUNC_map_lookup_elem' val=1 'BPF_FUNC_map_update_elem' val=2 As can be seen from diff above, the only thing that changed was resulting BTF type ID of ENUM bpf_func_id, not any of the enumerators, their names or integer values. The only other place that needed fixing was scripts/bpf_doc.py used to generate man pages and bpf_helper_defs.h header for libbpf and selftests. That script is tightly-coupled to exact shape of ___BPF_FUNC_MAPPER macro definition, so had to be trivially adapted. Cc: Quentin Monnet <quentin@isovalent.com> Reported-by: Andrea Terzolo <andrea.terzolo@polito.it> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Quentin Monnet <quentin@isovalent.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/r/20221006042452.2089843-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Merge in the left-over fixes before the net-next pull-request. Conflicts: drivers/net/ethernet/mediatek/mtk_ppe.c ae3ed15da588 ("net: ethernet: mtk_eth_soc: fix state in __mtk_foe_entry_clear") 9d8cb4c096ab ("net: ethernet: mtk_eth_soc: add foe_entry_size to mtk_eth_soc") https://lore.kernel.org/all/6cb6893b-4921-a068-4c30-1109795110bb@tessares.net/ kernel/bpf/helpers.c 8addbfc7b308 ("bpf: Gate dynptr API behind CAP_BPF") 5679ff2f138f ("bpf: Move bpf_loop and bpf_for_each_map_elem under CAP_BPF") 8a67f2de9b1d ("bpf: expose bpf_strtol and bpf_strtoul to all program types") https://lore.kernel.org/all/20221003201957.13149-1-daniel@iogearbox.net/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-28bpf: Handle bpf_link_info for the parameterized task BPF iterators.Kui-Feng Lee
Add new fields to bpf_link_info that users can query it through bpf_obj_get_info_by_fd(). Signed-off-by: Kui-Feng Lee <kuifeng@fb.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/bpf/20220926184957.208194-3-kuifeng@fb.com
2022-09-28bpf: Parameterize task iterators.Kui-Feng Lee
Allow creating an iterator that loops through resources of one thread/process. People could only create iterators to loop through all resources of files, vma, and tasks in the system, even though they were interested in only the resources of a specific task or process. Passing the additional parameters, people can now create an iterator to go through all resources or only the resources of a task. Signed-off-by: Kui-Feng Lee <kuifeng@fb.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/bpf/20220926184957.208194-2-kuifeng@fb.com
2022-09-26bpf: Return value in kprobe get_func_ip only for entry addressJiri Olsa
Changing return value of kprobe's version of bpf_get_func_ip to return zero if the attach address is not on the function's entry point. For kprobes attached in the middle of the function we can't easily get to the function address especially now with the CONFIG_X86_KERNEL_IBT support. If user cares about current IP for kprobes attached within the function body, they can get it with PT_REGS_IP(ctx). Suggested-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20220926153340.1621984-6-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-21bpf: Add bpf_user_ringbuf_drain() helperDavid Vernet
In a prior change, we added a new BPF_MAP_TYPE_USER_RINGBUF map type which will allow user-space applications to publish messages to a ring buffer that is consumed by a BPF program in kernel-space. In order for this map-type to be useful, it will require a BPF helper function that BPF programs can invoke to drain samples from the ring buffer, and invoke callbacks on those samples. This change adds that capability via a new BPF helper function: bpf_user_ringbuf_drain(struct bpf_map *map, void *callback_fn, void *ctx, u64 flags) BPF programs may invoke this function to run callback_fn() on a series of samples in the ring buffer. callback_fn() has the following signature: long callback_fn(struct bpf_dynptr *dynptr, void *context); Samples are provided to the callback in the form of struct bpf_dynptr *'s, which the program can read using BPF helper functions for querying struct bpf_dynptr's. In order to support bpf_ringbuf_drain(), a new PTR_TO_DYNPTR register type is added to the verifier to reflect a dynptr that was allocated by a helper function and passed to a BPF program. Unlike PTR_TO_STACK dynptrs which are allocated on the stack by a BPF program, PTR_TO_DYNPTR dynptrs need not use reference tracking, as the BPF helper is trusted to properly free the dynptr before returning. The verifier currently only supports PTR_TO_DYNPTR registers that are also DYNPTR_TYPE_LOCAL. Note that while the corresponding user-space libbpf logic will be added in a subsequent patch, this patch does contain an implementation of the .map_poll() callback for BPF_MAP_TYPE_USER_RINGBUF maps. This .map_poll() callback guarantees that an epoll-waiting user-space producer will receive at least one event notification whenever at least one sample is drained in an invocation of bpf_user_ringbuf_drain(), provided that the function is not invoked with the BPF_RB_NO_WAKEUP flag. If the BPF_RB_FORCE_WAKEUP flag is provided, a wakeup notification is sent even if no sample was drained. Signed-off-by: David Vernet <void@manifault.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220920000100.477320-3-void@manifault.com
2022-09-21bpf: Define new BPF_MAP_TYPE_USER_RINGBUF map typeDavid Vernet
We want to support a ringbuf map type where samples are published from user-space, to be consumed by BPF programs. BPF currently supports a kernel -> user-space circular ring buffer via the BPF_MAP_TYPE_RINGBUF map type. We'll need to define a new map type for user-space -> kernel, as none of the helpers exported for BPF_MAP_TYPE_RINGBUF will apply to a user-space producer ring buffer, and we'll want to add one or more helper functions that would not apply for a kernel-producer ring buffer. This patch therefore adds a new BPF_MAP_TYPE_USER_RINGBUF map type definition. The map type is useless in its current form, as there is no way to access or use it for anything until we one or more BPF helpers. A follow-on patch will therefore add a new helper function that allows BPF programs to run callbacks on samples that are published to the ring buffer. Signed-off-by: David Vernet <void@manifault.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220920000100.477320-2-void@manifault.com
2022-09-21bpf, cgroup: Reject prog_attach_flags array when effective queryPu Lehui
Attach flags is only valid for attached progs of this layer cgroup, but not for effective progs. For querying with EFFECTIVE flags, exporting attach flags does not make sense. So when effective query, we reject prog_attach_flags array and don't need to populate it. Also we limit attach_flags to output 0 during effective query. Fixes: b79c9fc9551b ("bpf: implement BPF_PROG_QUERY for BPF_LSM_CGROUP") Signed-off-by: Pu Lehui <pulehui@huawei.com> Link: https://lore.kernel.org/r/20220921104604.2340580-2-pulehui@huaweicloud.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-09-06bpf: Update descriptions for helpers bpf_get_func_arg[_cnt]()Yonghong Song
Now instead of the number of arguments, the number of registers holding argument values are stored in trampoline. Update the description of bpf_get_func_arg[_cnt]() helpers. Previous programs without struct arguments should continue to work as usual. Signed-off-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20220831152657.2078805-1-yhs@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-02bpf: Support getting tunnel flagsShmulik Ladkani
Existing 'bpf_skb_get_tunnel_key' extracts various tunnel parameters (id, ttl, tos, local and remote) but does not expose ip_tunnel_info's tun_flags to the BPF program. It makes sense to expose tun_flags to the BPF program. Assume for example multiple GRE tunnels maintained on a single GRE interface in collect_md mode. The program expects origins to initiate over GRE, however different origins use different GRE characteristics (e.g. some prefer to use GRE checksum, some do not; some pass a GRE key, some do not, etc..). A BPF program getting tun_flags can therefore remember the relevant flags (e.g. TUNNEL_CSUM, TUNNEL_SEQ...) for each initiating remote. In the reply path, the program can use 'bpf_skb_set_tunnel_key' in order to correctly reply to the remote, using similar characteristics, based on the stored tunnel flags. Introduce BPF_F_TUNINFO_FLAGS flag for bpf_skb_get_tunnel_key. If specified, 'bpf_tunnel_key->tunnel_flags' is set with the tun_flags. Decided to use the existing unused 'tunnel_ext' as the storage for the 'tunnel_flags' in order to avoid changing bpf_tunnel_key's layout. Also, the following has been considered during the design: 1. Convert the "interesting" internal TUNNEL_xxx flags back to BPF_F_yyy and place into the new 'tunnel_flags' field. This has 2 drawbacks: - The BPF_F_yyy flags are from *set_tunnel_key* enumeration space, e.g. BPF_F_ZERO_CSUM_TX. It is awkward that it is "returned" into tunnel_flags from a *get_tunnel_key* call. - Not all "interesting" TUNNEL_xxx flags can be mapped to existing BPF_F_yyy flags, and it doesn't make sense to create new BPF_F_yyy flags just for purposes of the returned tunnel_flags. 2. Place key.tun_flags into 'tunnel_flags' but mask them, keeping only "interesting" flags. That's ok, but the drawback is that what's "interesting" for my usecase might be limiting for other usecases. Therefore I decided to expose what's in key.tun_flags *as is*, which seems most flexible. The BPF user can just choose to ignore bits he's not interested in. The TUNNEL_xxx are also UAPI, so no harm exposing them back in the get_tunnel_key call. Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220831144010.174110-1-shmulik.ladkani@gmail.com
2022-08-26bpf: Fix a few typos in BPF helpers documentationQuentin Monnet
Address a few typos in the documentation for the BPF helper functions. They were reported by Jakub [0], who ran spell checkers on the generated man page [1]. [0] https://lore.kernel.org/linux-man/d22dcd47-023c-8f52-d369-7b5308e6c842@gmail.com/T/#mb02e7d4b7fb61d98fa914c77b581184e9a9537af [1] https://lore.kernel.org/linux-man/eb6a1e41-c48e-ac45-5154-ac57a2c76108@gmail.com/T/#m4a8d1b003616928013ffcd1450437309ab652f9f v3: Do not copy unrelated (and breaking) elements to tools/ header v2: Turn a ',' into a ';' Reported-by: Jakub Wilk <jwilk@jwilk.net> Signed-off-by: Quentin Monnet <quentin@isovalent.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220825220806.107143-1-quentin@isovalent.com
2022-08-25bpf: Add CGROUP prefix to cgroup_iter_orderHao Luo
bpf_cgroup_iter_order is globally visible but the entries do not have CGROUP prefix. As requested by Andrii, put a CGROUP in the names in bpf_cgroup_iter_order. This patch fixes two previous commits: one introduced the API and the other uses the API in bpf selftest (that is, the selftest cgroup_hierarchical_stats). I tested this patch via the following command: test_progs -t cgroup,iter,btf_dump Fixes: d4ccaf58a847 ("bpf: Introduce cgroup iter") Fixes: 88886309d2e8 ("selftests/bpf: add a selftest for cgroup hierarchical stats collection") Suggested-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Hao Luo <haoluo@google.com> Link: https://lore.kernel.org/r/20220825223936.1865810-1-haoluo@google.com Signed-off-by: Martin KaFai Lau <kafai@fb.com>
2022-08-25bpf: Introduce cgroup iterHao Luo
Cgroup_iter is a type of bpf_iter. It walks over cgroups in four modes: - walking a cgroup's descendants in pre-order. - walking a cgroup's descendants in post-order. - walking a cgroup's ancestors. - process only the given cgroup. When attaching cgroup_iter, one can set a cgroup to the iter_link created from attaching. This cgroup is passed as a file descriptor or cgroup id and serves as the starting point of the walk. If no cgroup is specified, the starting point will be the root cgroup v2. For walking descendants, one can specify the order: either pre-order or post-order. For walking ancestors, the walk starts at the specified cgroup and ends at the root. One can also terminate the walk early by returning 1 from the iter program. Note that because walking cgroup hierarchy holds cgroup_mutex, the iter program is called with cgroup_mutex held. Currently only one session is supported, which means, depending on the volume of data bpf program intends to send to user space, the number of cgroups that can be walked is limited. For example, given the current buffer size is 8 * PAGE_SIZE, if the program sends 64B data for each cgroup, assuming PAGE_SIZE is 4kb, the total number of cgroups that can be walked is 512. This is a limitation of cgroup_iter. If the output data is larger than the kernel buffer size, after all data in the kernel buffer is consumed by user space, the subsequent read() syscall will signal EOPNOTSUPP. In order to work around, the user may have to update their program to reduce the volume of data sent to output. For example, skip some uninteresting cgroups. In future, we may extend bpf_iter flags to allow customizing buffer size. Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Hao Luo <haoluo@google.com> Link: https://lore.kernel.org/r/20220824233117.1312810-2-haoluo@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-08-23bpf: update bpf_{g,s}et_retval documentationStanislav Fomichev
* replace 'syscall' with 'upper layers', still mention that it's being exported via syscall errno * describe what happens in set_retval(-EPERM) + return 1 * describe what happens with bind's 'return 3' Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20220823222555.523590-5-sdf@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-08-23bpf, flow_dissector: Introduce BPF_FLOW_DISSECTOR_CONTINUE retcode for bpf progsShmulik Ladkani
Currently, attaching BPF_PROG_TYPE_FLOW_DISSECTOR programs completely replaces the flow-dissector logic with custom dissection logic. This forces implementors to write programs that handle dissection for any flows expected in the namespace. It makes sense for flow-dissector BPF programs to just augment the dissector with custom logic (e.g. dissecting certain flows or custom protocols), while enjoying the broad capabilities of the standard dissector for any other traffic. Introduce BPF_FLOW_DISSECTOR_CONTINUE retcode. Flow-dissector BPF programs may return this to indicate no dissection was made, and fallback to the standard dissector is requested. Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Stanislav Fomichev <sdf@google.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20220821113519.116765-3-shmulik.ladkani@gmail.com