summaryrefslogtreecommitdiff
path: root/kernel/bpf
AgeCommit message (Collapse)Author
2021-06-02bpf, lockdown, audit: Fix buggy SELinux lockdown permission checksDaniel Borkmann
Commit 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown") added an implementation of the locked_down LSM hook to SELinux, with the aim to restrict which domains are allowed to perform operations that would breach lockdown. This is indirectly also getting audit subsystem involved to report events. The latter is problematic, as reported by Ondrej and Serhei, since it can bring down the whole system via audit: 1) The audit events that are triggered due to calls to security_locked_down() can OOM kill a machine, see below details [0]. 2) It also seems to be causing a deadlock via avc_has_perm()/slow_avc_audit() when trying to wake up kauditd, for example, when using trace_sched_switch() tracepoint, see details in [1]. Triggering this was not via some hypothetical corner case, but with existing tools like runqlat & runqslower from bcc, for example, which make use of this tracepoint. Rough call sequence goes like: rq_lock(rq) -> -------------------------+ trace_sched_switch() -> | bpf_prog_xyz() -> +-> deadlock selinux_lockdown() -> | audit_log_end() -> | wake_up_interruptible() -> | try_to_wake_up() -> | rq_lock(rq) --------------+ What's worse is that the intention of 59438b46471a to further restrict lockdown settings for specific applications in respect to the global lockdown policy is completely broken for BPF. The SELinux policy rule for the current lockdown check looks something like this: allow <who> <who> : lockdown { <reason> }; However, this doesn't match with the 'current' task where the security_locked_down() is executed, example: httpd does a syscall. There is a tracing program attached to the syscall which triggers a BPF program to run, which ends up doing a bpf_probe_read_kernel{,_str}() helper call. The selinux_lockdown() hook does the permission check against 'current', that is, httpd in this example. httpd has literally zero relation to this tracing program, and it would be nonsensical having to write an SELinux policy rule against httpd to let the tracing helper pass. The policy in this case needs to be against the entity that is installing the BPF program. For example, if bpftrace would generate a histogram of syscall counts by user space application: bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }' bpftrace would then go and generate a BPF program from this internally. One way of doing it [for the sake of the example] could be to call bpf_get_current_task() helper and then access current->comm via one of bpf_probe_read_kernel{,_str}() helpers. So the program itself has nothing to do with httpd or any other random app doing a syscall here. The BPF program _explicitly initiated_ the lockdown check. The allow/deny policy belongs in the context of bpftrace: meaning, you want to grant bpftrace access to use these helpers, but other tracers on the system like my_random_tracer _not_. Therefore fix all three issues at the same time by taking a completely different approach for the security_locked_down() hook, that is, move the check into the program verification phase where we actually retrieve the BPF func proto. This also reliably gets the task (current) that is trying to install the BPF tracing program, e.g. bpftrace/bcc/perf/systemtap/etc, and it also fixes the OOM since we're moving this out of the BPF helper's fast-path which can be called several millions of times per second. The check is then also in line with other security_locked_down() hooks in the system where the enforcement is performed at open/load time, for example, open_kcore() for /proc/kcore access or module_sig_check() for module signatures just to pick few random ones. What's out of scope in the fix as well as in other security_locked_down() hook locations /outside/ of BPF subsystem is that if the lockdown policy changes on the fly there is no retrospective action. This requires a different discussion, potentially complex infrastructure, and it's also not clear whether this can be solved generically. Either way, it is out of scope for a suitable stable fix which this one is targeting. Note that the breakage is specifically on 59438b46471a where it started to rely on 'current' as UAPI behavior, and _not_ earlier infrastructure such as 9d1f8be5cf42 ("bpf: Restrict bpf when kernel lockdown is in confidentiality mode"). [0] https://bugzilla.redhat.com/show_bug.cgi?id=1955585, Jakub Hrozek says: I starting seeing this with F-34. When I run a container that is traced with BPF to record the syscalls it is doing, auditd is flooded with messages like: type=AVC msg=audit(1619784520.593:282387): avc: denied { confidentiality } for pid=476 comm="auditd" lockdown_reason="use of bpf to read kernel RAM" scontext=system_u:system_r:auditd_t:s0 tcontext=system_u:system_r:auditd_t:s0 tclass=lockdown permissive=0 This seems to be leading to auditd running out of space in the backlog buffer and eventually OOMs the machine. [...] auditd running at 99% CPU presumably processing all the messages, eventually I get: Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152579 > audit_backlog_limit=64 Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152626 > audit_backlog_limit=64 Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152694 > audit_backlog_limit=64 Apr 30 12:20:42 fedora kernel: audit: audit_lost=6878426 audit_rate_limit=0 audit_backlog_limit=64 Apr 30 12:20:45 fedora kernel: oci-seccomp-bpf invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=-1000 Apr 30 12:20:45 fedora kernel: CPU: 0 PID: 13284 Comm: oci-seccomp-bpf Not tainted 5.11.12-300.fc34.x86_64 #1 Apr 30 12:20:45 fedora kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014 [...] [1] https://lore.kernel.org/linux-audit/CANYvDQN7H5tVp47fbYcRasv4XF07eUbsDwT_eDCHXJUj43J7jQ@mail.gmail.com/, Serhei Makarov says: Upstream kernel 5.11.0-rc7 and later was found to deadlock during a bpf_probe_read_compat() call within a sched_switch tracepoint. The problem is reproducible with the reg_alloc3 testcase from SystemTap's BPF backend testsuite on x86_64 as well as the runqlat, runqslower tools from bcc on ppc64le. Example stack trace: [...] [ 730.868702] stack backtrace: [ 730.869590] CPU: 1 PID: 701 Comm: in:imjournal Not tainted, 5.12.0-0.rc2.20210309git144c79ef3353.166.fc35.x86_64 #1 [ 730.871605] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 [ 730.873278] Call Trace: [ 730.873770] dump_stack+0x7f/0xa1 [ 730.874433] check_noncircular+0xdf/0x100 [ 730.875232] __lock_acquire+0x1202/0x1e10 [ 730.876031] ? __lock_acquire+0xfc0/0x1e10 [ 730.876844] lock_acquire+0xc2/0x3a0 [ 730.877551] ? __wake_up_common_lock+0x52/0x90 [ 730.878434] ? lock_acquire+0xc2/0x3a0 [ 730.879186] ? lock_is_held_type+0xa7/0x120 [ 730.880044] ? skb_queue_tail+0x1b/0x50 [ 730.880800] _raw_spin_lock_irqsave+0x4d/0x90 [ 730.881656] ? __wake_up_common_lock+0x52/0x90 [ 730.882532] __wake_up_common_lock+0x52/0x90 [ 730.883375] audit_log_end+0x5b/0x100 [ 730.884104] slow_avc_audit+0x69/0x90 [ 730.884836] avc_has_perm+0x8b/0xb0 [ 730.885532] selinux_lockdown+0xa5/0xd0 [ 730.886297] security_locked_down+0x20/0x40 [ 730.887133] bpf_probe_read_compat+0x66/0xd0 [ 730.887983] bpf_prog_250599c5469ac7b5+0x10f/0x820 [ 730.888917] trace_call_bpf+0xe9/0x240 [ 730.889672] perf_trace_run_bpf_submit+0x4d/0xc0 [ 730.890579] perf_trace_sched_switch+0x142/0x180 [ 730.891485] ? __schedule+0x6d8/0xb20 [ 730.892209] __schedule+0x6d8/0xb20 [ 730.892899] schedule+0x5b/0xc0 [ 730.893522] exit_to_user_mode_prepare+0x11d/0x240 [ 730.894457] syscall_exit_to_user_mode+0x27/0x70 [ 730.895361] entry_SYSCALL_64_after_hwframe+0x44/0xae [...] Fixes: 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown") Reported-by: Ondrej Mosnacek <omosnace@redhat.com> Reported-by: Jakub Hrozek <jhrozek@redhat.com> Reported-by: Serhei Makarov <smakarov@redhat.com> Reported-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Jiri Olsa <jolsa@redhat.com> Cc: Paul Moore <paul@paul-moore.com> Cc: James Morris <jamorris@linux.microsoft.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Frank Eigler <fche@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/bpf/01135120-8bf7-df2e-cff0-1d73f1f841c3@iogearbox.net
2021-06-01bpf, tnums: Provably sound, faster, and more precise algorithm for tnum_mulHarishankar Vishwanathan
This patch introduces a new algorithm for multiplication of tristate numbers (tnums) that is provably sound. It is faster and more precise when compared to the existing method. Like the existing method, this new algorithm follows the long multiplication algorithm. The idea is to generate partial products by multiplying each bit in the multiplier (tnum a) with the multiplicand (tnum b), and adding the partial products after appropriately bit-shifting them. The new algorithm, however, uses just a single loop over the bits of the multiplier (tnum a) and accumulates only the uncertain components of the multiplicand (tnum b) into a mask-only tnum. The following paper explains the algorithm in more detail: https://arxiv.org/abs/2105.05398. A natural way to construct the tnum product is by performing a tnum addition on all the partial products. This algorithm presents another method of doing this: decompose each partial product into two tnums, consisting of the values and the masks separately. The mask-sum is accumulated within the loop in acc_m. The value-sum tnum is generated using a.value * b.value. The tnum constructed by tnum addition of the value-sum and the mask-sum contains all possible summations of concrete values drawn from the partial product tnums pairwise. We prove this result in the paper. Our evaluations show that the new algorithm is overall more precise (producing tnums with less uncertain components) than the existing method. As an illustrative example, consider the input tnums A and B. The numbers in the parenthesis correspond to (value;mask). A = 000000x1 (1;2) B = 0010011x (38;1) A * B (existing) = xxxxxxxx (0;255) A * B (new) = 0x1xxxxx (32;95) Importantly, we present a proof of soundness of the new algorithm in the aforementioned paper. Additionally, we show that this new algorithm is empirically faster than the existing method. Co-developed-by: Matan Shachnai <m.shachnai@rutgers.edu> Co-developed-by: Srinivas Narayana <srinivas.narayana@rutgers.edu> Co-developed-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu> Signed-off-by: Matan Shachnai <m.shachnai@rutgers.edu> Signed-off-by: Srinivas Narayana <srinivas.narayana@rutgers.edu> Signed-off-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu> Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@rutgers.edu> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Edward Cree <ecree.xilinx@gmail.com> Link: https://arxiv.org/abs/2105.05398 Link: https://lore.kernel.org/bpf/20210531020157.7386-1-harishankar.vishwanathan@rutgers.edu
2021-05-28bpf, devmap: Remove drops variable from bq_xmit_all()Hangbin Liu
As Colin pointed out, the first drops assignment after declaration will be overwritten by the second drops assignment before using, which makes it useless. Since the drops variable will be used only once. Just remove it and use "cnt - sent" in trace_xdp_devmap_xmit(). Fixes: cb261b594b41 ("bpf: Run devmap xdp_prog on flush instead of bulk enqueue") Reported-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20210528024356.24333-1-liuhangbin@gmail.com
2021-05-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
cdc-wdm: s/kill_urbs/poison_urbs/ to fix build Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-05-26libbpf: Move BPF_SEQ_PRINTF and BPF_SNPRINTF to bpf_helpers.hFlorent Revest
These macros are convenient wrappers around the bpf_seq_printf and bpf_snprintf helpers. They are currently provided by bpf_tracing.h which targets low level tracing primitives. bpf_helpers.h is a better fit. The __bpf_narg and __bpf_apply are needed in both files and provided twice. __bpf_empty isn't used anywhere and is removed from bpf_tracing.h Reported-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Florent Revest <revest@chromium.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210526164643.2881368-1-revest@chromium.org
2021-05-26xdp: Extend xdp_redirect_map with broadcast supportHangbin Liu
This patch adds two flags BPF_F_BROADCAST and BPF_F_EXCLUDE_INGRESS to extend xdp_redirect_map for broadcast support. With BPF_F_BROADCAST the packet will be broadcasted to all the interfaces in the map. with BPF_F_EXCLUDE_INGRESS the ingress interface will be excluded when do broadcasting. When getting the devices in dev hash map via dev_map_hash_get_next_key(), there is a possibility that we fall back to the first key when a device was removed. This will duplicate packets on some interfaces. So just walk the whole buckets to avoid this issue. For dev array map, we also walk the whole map to find valid interfaces. Function bpf_clear_redirect_map() was removed in commit ee75aef23afe ("bpf, xdp: Restructure redirect actions"). Add it back as we need to use ri->map again. With test topology: +-------------------+ +-------------------+ | Host A (i40e 10G) | ---------- | eno1(i40e 10G) | +-------------------+ | | | Host B | +-------------------+ | | | Host C (i40e 10G) | ---------- | eno2(i40e 10G) | +-------------------+ | | | +------+ | | veth0 -- | Peer | | | veth1 -- | | | | veth2 -- | NS | | | +------+ | +-------------------+ On Host A: # pktgen/pktgen_sample03_burst_single_flow.sh -i eno1 -d $dst_ip -m $dst_mac -s 64 On Host B(Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz, 128G Memory): Use xdp_redirect_map and xdp_redirect_map_multi in samples/bpf for testing. All the veth peers in the NS have a XDP_DROP program loaded. The forward_map max_entries in xdp_redirect_map_multi is modify to 4. Testing the performance impact on the regular xdp_redirect path with and without patch (to check impact of additional check for broadcast mode): 5.12 rc4 | redirect_map i40e->i40e | 2.0M | 9.7M 5.12 rc4 | redirect_map i40e->veth | 1.7M | 11.8M 5.12 rc4 + patch | redirect_map i40e->i40e | 2.0M | 9.6M 5.12 rc4 + patch | redirect_map i40e->veth | 1.7M | 11.7M Testing the performance when cloning packets with the redirect_map_multi test, using a redirect map size of 4, filled with 1-3 devices: 5.12 rc4 + patch | redirect_map multi i40e->veth (x1) | 1.7M | 11.4M 5.12 rc4 + patch | redirect_map multi i40e->veth (x2) | 1.1M | 4.3M 5.12 rc4 + patch | redirect_map multi i40e->veth (x3) | 0.8M | 2.6M Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Link: https://lore.kernel.org/bpf/20210519090747.1655268-3-liuhangbin@gmail.com
2021-05-26bpf: Run devmap xdp_prog on flush instead of bulk enqueueJesper Dangaard Brouer
This changes the devmap XDP program support to run the program when the bulk queue is flushed instead of before the frame is enqueued. This has a couple of benefits: - It "sorts" the packets by destination devmap entry, and then runs the same BPF program on all the packets in sequence. This ensures that we keep the XDP program and destination device properties hot in I-cache. - It makes the multicast implementation simpler because it can just enqueue packets using bq_enqueue() without having to deal with the devmap program at all. The drawback is that if the devmap program drops the packet, the enqueue step is redundant. However, arguably this is mostly visible in a micro-benchmark, and with more mixed traffic the I-cache benefit should win out. The performance impact of just this patch is as follows: Using 2 10Gb i40e NIC, redirecting one to another, or into a veth interface, which do XDP_DROP on veth peer. With xdp_redirect_map in sample/bpf, send pkts via pktgen cmd: ./pktgen_sample03_burst_single_flow.sh -i eno1 -d $dst_ip -m $dst_mac -t 10 -s 64 There are about +/- 0.1M deviation for native testing, the performance improved for the base-case, but some drop back with xdp devmap prog attached. Version | Test | Generic | Native | Native + 2nd xdp_prog 5.12 rc4 | xdp_redirect_map i40e->i40e | 1.9M | 9.6M | 8.4M 5.12 rc4 | xdp_redirect_map i40e->veth | 1.7M | 11.7M | 9.8M 5.12 rc4 + patch | xdp_redirect_map i40e->i40e | 1.9M | 9.8M | 8.0M 5.12 rc4 + patch | xdp_redirect_map i40e->veth | 1.7M | 12.0M | 9.4M When bq_xmit_all() is called from bq_enqueue(), another packet will always be enqueued immediately after, so clearing dev_rx, xdp_prog and flush_node in bq_xmit_all() is redundant. Move the clear to __dev_flush(), and only check them once in bq_enqueue() since they are all modified together. This change also has the side effect of extending the lifetime of the RCU-protected xdp_prog that lives inside the devmap entries: Instead of just living for the duration of the XDP program invocation, the reference now lives all the way until the bq is flushed. This is safe because the bq flush happens at the end of the NAPI poll loop, so everything happens between a local_bh_disable()/local_bh_enable() pair. However, this is by no means obvious from looking at the call sites; in particular, some drivers have an additional rcu_read_lock() around only the XDP program invocation, which only confuses matters further. Cleaning this up will be done in a separate patch series. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20210519090747.1655268-2-liuhangbin@gmail.com
2021-05-25bpf: No need to simulate speculative domain for immediatesDaniel Borkmann
In 801c6058d14a ("bpf: Fix leakage of uninitialized bpf stack under speculation") we replaced masking logic with direct loads of immediates if the register is a known constant. Given in this case we do not apply any masking, there is also no reason for the operation to be truncated under the speculative domain. Therefore, there is also zero reason for the verifier to branch-off and simulate this case, it only needs to do it for unknown but bounded scalars. As a side-effect, this also enables few test cases that were previously rejected due to simulation under zero truncation. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Piotr Krysiuk <piotras@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org>
2021-05-25bpf: Fix mask direction swap upon off reg sign changeDaniel Borkmann
Masking direction as indicated via mask_to_left is considered to be calculated once and then used to derive pointer limits. Thus, this needs to be placed into bpf_sanitize_info instead so we can pass it to sanitize_ptr_alu() call after the pointer move. Piotr noticed a corner case where the off reg causes masking direction change which then results in an incorrect final aux->alu_limit. Fixes: 7fedb63a8307 ("bpf: Tighten speculative pointer arithmetic mask") Reported-by: Piotr Krysiuk <piotras@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Piotr Krysiuk <piotras@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org>
2021-05-25bpf: Wrap aux data inside bpf_sanitize_info containerDaniel Borkmann
Add a container structure struct bpf_sanitize_info which holds the current aux info, and update call-sites to sanitize_ptr_alu() to pass it in. This is needed for passing in additional state later on. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Piotr Krysiuk <piotras@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org>
2021-05-25bpf: Fix BPF_LSM kconfig symbol dependencyDaniel Borkmann
Similarly as 6bdacdb48e94 ("bpf: Fix BPF_JIT kconfig symbol dependency") we need to detangle the hard BPF_LSM dependency on NET. This was previously implicit by its dependency on BPF_JIT which itself was dependent on NET (but without any actual/real hard dependency code-wise). Given the latter was lifted, so should be the former as BPF_LSMs could well exist on net-less systems. This therefore also fixes a randconfig build error recently reported by Randy: ld: kernel/bpf/bpf_lsm.o: in function `bpf_lsm_func_proto': bpf_lsm.c:(.text+0x1a0): undefined reference to `bpf_sk_storage_get_proto' ld: bpf_lsm.c:(.text+0x1b8): undefined reference to `bpf_sk_storage_delete_proto' [...] Fixes: b24abcff918a ("bpf, kconfig: Add consolidated menu entry for bpf with core options") Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org>
2021-05-24bpf: Fix spelling mistakesZhen Lei
Fix some spelling mistakes in comments: aother ==> another Netiher ==> Neither desribe ==> describe intializing ==> initializing funciton ==> function wont ==> won't and move the word 'the' at the end to the next line accross ==> across pathes ==> paths triggerred ==> triggered excute ==> execute ether ==> either conervative ==> conservative convetion ==> convention markes ==> marks interpeter ==> interpreter Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210525025659.8898-2-thunder.leizhen@huawei.com
2021-05-24bpf: Add lookup_and_delete_elem support to hashtabDenis Salopek
Extend the existing bpf_map_lookup_and_delete_elem() functionality to hashtab map types, in addition to stacks and queues. Create a new hashtab bpf_map_ops function that does lookup and deletion of the element under the same bucket lock and add the created map_ops to bpf.h. Signed-off-by: Denis Salopek <denis.salopek@sartura.hr> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/4d18480a3e990ffbf14751ddef0325eed3be2966.1620763117.git.denis.salopek@sartura.hr
2021-05-20bpf, offload: Reorder offload callback 'prepare' in verifierYinjun Zhang
Commit 4976b718c355 ("bpf: Introduce pseudo_btf_id") switched the order of resolve_pseudo_ldimm(), in which some pseudo instructions are rewritten. Thus those rewritten instructions cannot be passed to driver via 'prepare' offload callback. Reorder the 'prepare' offload callback to fix it. Fixes: 4976b718c355 ("bpf: Introduce pseudo_btf_id") Signed-off-by: Yinjun Zhang <yinjun.zhang@corigine.com> Signed-off-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20210520085834.15023-1-simon.horman@netronome.com
2021-05-20bpf: Avoid using ARRAY_SIZE on an uninitialized pointerFlorent Revest
The cppcheck static code analysis reported the following error: if (WARN_ON_ONCE(nest_level > ARRAY_SIZE(bufs->tmp_bufs))) { ^ ARRAY_SIZE is a macro that expands to sizeofs, so bufs is not actually dereferenced at runtime, and the code is actually safe. But to keep things tidy, this patch removes the need for a call to ARRAY_SIZE by extracting the size of the array into a macro. Cppcheck should no longer be confused and the code ends up being a bit cleaner. Fixes: e2d5b2bb769f ("bpf: Fix nested bpf_bprintf_prepare with more per-cpu buffers") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Florent Revest <revest@chromium.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/bpf/20210517092830.1026418-2-revest@chromium.org
2021-05-20bpf: Clarify a bpf_bprintf_prepare macroFlorent Revest
The per-cpu buffers contain bprintf data rather than printf arguments. The macro name and comment were a bit confusing, this rewords them in a clearer way. Signed-off-by: Florent Revest <revest@chromium.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/bpf/20210517092830.1026418-1-revest@chromium.org
2021-05-20bpf: Fix BPF_JIT kconfig symbol dependencyDaniel Borkmann
Randy reported a randconfig build error recently on i386: ld: arch/x86/net/bpf_jit_comp32.o: in function `do_jit': bpf_jit_comp32.c:(.text+0x28c9): undefined reference to `__bpf_call_base' ld: arch/x86/net/bpf_jit_comp32.o: in function `bpf_int_jit_compile': bpf_jit_comp32.c:(.text+0x3694): undefined reference to `bpf_jit_blind_constants' ld: bpf_jit_comp32.c:(.text+0x3719): undefined reference to `bpf_jit_binary_free' ld: bpf_jit_comp32.c:(.text+0x3745): undefined reference to `bpf_jit_binary_alloc' ld: bpf_jit_comp32.c:(.text+0x37d3): undefined reference to `bpf_jit_prog_release_other' [...] The cause was that b24abcff918a ("bpf, kconfig: Add consolidated menu entry for bpf with core options") moved BPF_JIT from net/Kconfig into kernel/bpf/Kconfig and previously BPF_JIT was guarded by a 'if NET'. However, there is no actual dependency on NET, it's just that menuconfig NET selects BPF. And the latter in turn causes kernel/bpf/core.o to be built which contains above symbols. Randy's randconfig didn't have NET set, and BPF wasn't either, but BPF_JIT otoh was. Detangle this by making BPF_JIT depend on BPF instead. arm64 was the only arch that pulled in its JIT in net/ via obj-$(CONFIG_NET), all others unconditionally pull this dir in via obj-y. Do the same since CONFIG_NET guard there is really useless as we compiled the JIT via obj-$(CONFIG_BPF_JIT) += bpf_jit_comp.o anyway. Fixes: b24abcff918a ("bpf, kconfig: Add consolidated menu entry for bpf with core options") Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org>
2021-05-19bpf: Make some symbols staticPu Lehui
The sparse tool complains as follows: kernel/bpf/syscall.c:4567:29: warning: symbol 'bpf_sys_bpf_proto' was not declared. Should it be static? kernel/bpf/syscall.c:4592:29: warning: symbol 'bpf_sys_close_proto' was not declared. Should it be static? This symbol is not used outside of syscall.c, so marks it static. Signed-off-by: Pu Lehui <pulehui@huawei.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20210519064116.240536-1-pulehui@huawei.com
2021-05-19bpf: Add bpf_sys_close() helper.Alexei Starovoitov
Add bpf_sys_close() helper to be used by the syscall/loader program to close intermediate FDs and other cleanup. Note this helper must never be allowed inside fdget/fdput bracketing. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210514003623.28033-11-alexei.starovoitov@gmail.com
2021-05-19bpf: Add bpf_btf_find_by_name_kind() helper.Alexei Starovoitov
Add new helper: long bpf_btf_find_by_name_kind(char *name, int name_sz, u32 kind, int flags) Description Find BTF type with given name and kind in vmlinux BTF or in module's BTFs. Return Returns btf_id and btf_obj_fd in lower and upper 32 bits. It will be used by loader program to find btf_id to attach the program to and to find btf_ids of ksyms. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210514003623.28033-10-alexei.starovoitov@gmail.com
2021-05-19bpf: Introduce fd_idxAlexei Starovoitov
Typical program loading sequence involves creating bpf maps and applying map FDs into bpf instructions in various places in the bpf program. This job is done by libbpf that is using compiler generated ELF relocations to patch certain instruction after maps are created and BTFs are loaded. The goal of fd_idx is to allow bpf instructions to stay immutable after compilation. At load time the libbpf would still create maps as usual, but it wouldn't need to patch instructions. It would store map_fds into __u32 fd_array[] and would pass that pointer to sys_bpf(BPF_PROG_LOAD). Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210514003623.28033-9-alexei.starovoitov@gmail.com
2021-05-19bpf: Make btf_load command to be bpfptr_t compatible.Alexei Starovoitov
Similar to prog_load make btf_load command to be availble to bpf_prog_type_syscall program. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210514003623.28033-7-alexei.starovoitov@gmail.com
2021-05-19bpf: Prepare bpf syscall to be used from kernel and user space.Alexei Starovoitov
With the help from bpfptr_t prepare relevant bpf syscall commands to be used from kernel and user space. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210514003623.28033-4-alexei.starovoitov@gmail.com
2021-05-19bpf: Introduce bpf_sys_bpf() helper and program type.Alexei Starovoitov
Add placeholders for bpf_sys_bpf() helper and new program type. Make sure to check that expected_attach_type is zero for future extensibility. Allow tracing helper functions to be used in this program type, since they will only execute from user context via bpf_prog_test_run. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210514003623.28033-2-alexei.starovoitov@gmail.com
2021-05-11bpf: Fix nested bpf_bprintf_prepare with more per-cpu buffersFlorent Revest
The bpf_seq_printf, bpf_trace_printk and bpf_snprintf helpers share one per-cpu buffer that they use to store temporary data (arguments to bprintf). They "get" that buffer with try_get_fmt_tmp_buf and "put" it by the end of their scope with bpf_bprintf_cleanup. If one of these helpers gets called within the scope of one of these helpers, for example: a first bpf program gets called, uses bpf_trace_printk which calls raw_spin_lock_irqsave which is traced by another bpf program that calls bpf_snprintf, then the second "get" fails. Essentially, these helpers are not re-entrant. They would return -EBUSY and print a warning message once. This patch triples the number of bprintf buffers to allow three levels of nesting. This is very similar to what was done for tracepoints in "9594dc3c7e7 bpf: fix nested bpf tracepoints with per-cpu data" Fixes: d9c9e4db186a ("bpf: Factorize bpf_trace_printk and bpf_seq_printf") Reported-by: syzbot+63122d0bc347f18c1884@syzkaller.appspotmail.com Signed-off-by: Florent Revest <revest@chromium.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210511081054.2125874-1-revest@chromium.org
2021-05-11bpf: Add deny list of btf ids check for tracing programsJiri Olsa
The recursion check in __bpf_prog_enter and __bpf_prog_exit leaves some (not inlined) functions unprotected: In __bpf_prog_enter: - migrate_disable is called before prog->active is checked In __bpf_prog_exit: - migrate_enable,rcu_read_unlock_strict are called after prog->active is decreased When attaching trampoline to them we get panic like: traps: PANIC: double fault, error_code: 0x0 double fault: 0000 [#1] SMP PTI RIP: 0010:__bpf_prog_enter+0x4/0x50 ... Call Trace: <IRQ> bpf_trampoline_6442466513_0+0x18/0x1000 migrate_disable+0x5/0x50 __bpf_prog_enter+0x9/0x50 bpf_trampoline_6442466513_0+0x18/0x1000 migrate_disable+0x5/0x50 __bpf_prog_enter+0x9/0x50 bpf_trampoline_6442466513_0+0x18/0x1000 migrate_disable+0x5/0x50 __bpf_prog_enter+0x9/0x50 bpf_trampoline_6442466513_0+0x18/0x1000 migrate_disable+0x5/0x50 ... Fixing this by adding deny list of btf ids for tracing programs and checking btf id during program verification. Adding above functions to this list. Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210429114712.43783-1-jolsa@kernel.org
2021-05-11bpf: Add kconfig knob for disabling unpriv bpf by defaultDaniel Borkmann
Add a kconfig knob which allows for unprivileged bpf to be disabled by default. If set, the knob sets /proc/sys/kernel/unprivileged_bpf_disabled to value of 2. This still allows a transition of 2 -> {0,1} through an admin. Similarly, this also still keeps 1 -> {1} behavior intact, so that once set to permanently disabled, it cannot be undone aside from a reboot. We've also added extra2 with max of 2 for the procfs handler, so that an admin still has a chance to toggle between 0 <-> 2. Either way, as an additional alternative, applications can make use of CAP_BPF that we added a while ago. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/74ec548079189e4e4dffaeb42b8987bb3c852eee.1620765074.git.daniel@iogearbox.net
2021-05-11bpf, kconfig: Add consolidated menu entry for bpf with core optionsDaniel Borkmann
Right now, all core BPF related options are scattered in different Kconfig locations mainly due to historic reasons. Moving forward, lets add a proper subsystem entry under ... General setup ---> BPF subsystem ---> ... in order to have all knobs in a single location and thus ease BPF related configuration. Networking related bits such as sockmap are out of scope for the general setup and therefore better suited to remain in net/Kconfig. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/f23f58765a4d59244ebd8037da7b6a6b2fb58446.1620765074.git.daniel@iogearbox.net
2021-05-11bpf: Prevent writable memory-mapping of read-only ringbuf pagesAndrii Nakryiko
Only the very first page of BPF ringbuf that contains consumer position counter is supposed to be mapped as writeable by user-space. Producer position is read-only and can be modified only by the kernel code. BPF ringbuf data pages are read-only as well and are not meant to be modified by user-code to maintain integrity of per-record headers. This patch allows to map only consumer position page as writeable and everything else is restricted to be read-only. remap_vmalloc_range() internally adds VM_DONTEXPAND, so all the established memory mappings can't be extended, which prevents any future violations through mremap()'ing. Fixes: 457f44363a88 ("bpf: Implement BPF ring buffer and verifier support for it") Reported-by: Ryota Shiga (Flatt Security) Reported-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org>
2021-05-11bpf, ringbuf: Deny reserve of buffers larger than ringbufThadeu Lima de Souza Cascardo
A BPF program might try to reserve a buffer larger than the ringbuf size. If the consumer pointer is way ahead of the producer, that would be successfully reserved, allowing the BPF program to read or write out of the ringbuf allocated area. Reported-by: Ryota Shiga (Flatt Security) Fixes: 457f44363a88 ("bpf: Implement BPF ring buffer and verifier support for it") Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Alexei Starovoitov <ast@kernel.org>
2021-05-11bpf: Fix alu32 const subreg bound tracking on bitwise operationsDaniel Borkmann
Fix a bug in the verifier's scalar32_min_max_*() functions which leads to incorrect tracking of 32 bit bounds for the simulation of and/or/xor bitops. When both the src & dst subreg is a known constant, then the assumption is that scalar_min_max_*() will take care to update bounds correctly. However, this is not the case, for example, consider a register R2 which has a tnum of 0xffffffff00000000, meaning, lower 32 bits are known constant and in this case of value 0x00000001. R2 is then and'ed with a register R3 which is a 64 bit known constant, here, 0x100000002. What can be seen in line '10:' is that 32 bit bounds reach an invalid state where {u,s}32_min_value > {u,s}32_max_value. The reason is scalar32_min_max_*() delegates 32 bit bounds updates to scalar_min_max_*(), however, that really only takes place when both the 64 bit src & dst register is a known constant. Given scalar32_min_max_*() is intended to be designed as closely as possible to scalar_min_max_*(), update the 32 bit bounds in this situation through __mark_reg32_known() which will set all {u,s}32_{min,max}_value to the correct constant, which is 0x00000000 after the fix (given 0x00000001 & 0x00000002 in 32 bit space). This is possible given var32_off already holds the final value as dst_reg->var_off is updated before calling scalar32_min_max_*(). Before fix, invalid tracking of R2: [...] 9: R0_w=inv1337 R1=ctx(id=0,off=0,imm=0) R2_w=inv(id=0,smin_value=-9223372036854775807 (0x8000000000000001),smax_value=9223372032559808513 (0x7fffffff00000001),umin_value=1,umax_value=0xffffffff00000001,var_off=(0x1; 0xffffffff00000000),s32_min_value=1,s32_max_value=1,u32_min_value=1,u32_max_value=1) R3_w=inv4294967298 R10=fp0 9: (5f) r2 &= r3 10: R0_w=inv1337 R1=ctx(id=0,off=0,imm=0) R2_w=inv(id=0,smin_value=0,smax_value=4294967296 (0x100000000),umin_value=0,umax_value=0x100000000,var_off=(0x0; 0x100000000),s32_min_value=1,s32_max_value=0,u32_min_value=1,u32_max_value=0) R3_w=inv4294967298 R10=fp0 [...] After fix, correct tracking of R2: [...] 9: R0_w=inv1337 R1=ctx(id=0,off=0,imm=0) R2_w=inv(id=0,smin_value=-9223372036854775807 (0x8000000000000001),smax_value=9223372032559808513 (0x7fffffff00000001),umin_value=1,umax_value=0xffffffff00000001,var_off=(0x1; 0xffffffff00000000),s32_min_value=1,s32_max_value=1,u32_min_value=1,u32_max_value=1) R3_w=inv4294967298 R10=fp0 9: (5f) r2 &= r3 10: R0_w=inv1337 R1=ctx(id=0,off=0,imm=0) R2_w=inv(id=0,smin_value=0,smax_value=4294967296 (0x100000000),umin_value=0,umax_value=0x100000000,var_off=(0x0; 0x100000000),s32_min_value=0,s32_max_value=0,u32_min_value=0,u32_max_value=0) R3_w=inv4294967298 R10=fp0 [...] Fixes: 3f50f132d840 ("bpf: Verifier, do explicit ALU32 bounds tracking") Fixes: 2921c90d4718 ("bpf: Fix a verifier failure with xor") Reported-by: Manfred Paul (@_manfp) Reported-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org>
2021-05-10bpf: verifier: Allocate idmap scratch in verifier envLorenz Bauer
func_states_equal makes a very short lived allocation for idmap, probably because it's too large to fit on the stack. However the function is called quite often, leading to a lot of alloc / free churn. Replace the temporary allocation with dedicated scratch space in struct bpf_verifier_env. Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Edward Cree <ecree.xilinx@gmail.com> Link: https://lore.kernel.org/bpf/20210429134656.122225-4-lmb@cloudflare.com
2021-05-10bpf: verifier: Use copy_array for jmp_historyLorenz Bauer
Eliminate a couple needless kfree / kmalloc cycles by using copy_array for jmp_history. Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210429134656.122225-3-lmb@cloudflare.com
2021-05-10bpf: verifier: Improve function state reallocationLorenz Bauer
Resizing and copying stack and reference tracking state currently does a lot of kfree / kmalloc when the size of the tracked set changes. The logic in copy_*_state and realloc_*_state is also hard to follow. Refactor this into two core functions. copy_array copies from a source into a destination. It avoids reallocation by taking the allocated size of the destination into account via ksize(). The function is essentially krealloc_array, with the difference that the contents of dst are not preserved. realloc_array changes the size of an array and zeroes newly allocated items. Contrary to krealloc both functions don't free the destination if the size is zero. Instead we rely on free_func_state to clean up. realloc_stack_state is renamed to grow_stack_state to better convey that it never shrinks the stack state. Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210429134656.122225-2-lmb@cloudflare.com
2021-05-07bpf: Forbid trampoline attach for functions with variable argumentsJiri Olsa
We can't currently allow to attach functions with variable arguments. The problem is that we should save all the registers for arguments, which is probably doable, but if caller uses more than 6 arguments, we need stack data, which will be wrong, because of the extra stack frame we do in bpf trampoline, so we could crash. Also currently there's malformed trampoline code generated for such functions at the moment as described in: https://lore.kernel.org/bpf/20210429212834.82621-1-jolsa@kernel.org/ Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210505132529.401047-1-jolsa@kernel.org
2021-05-03bpf: Fix leakage of uninitialized bpf stack under speculationDaniel Borkmann
The current implemented mechanisms to mitigate data disclosure under speculation mainly address stack and map value oob access from the speculative domain. However, Piotr discovered that uninitialized BPF stack is not protected yet, and thus old data from the kernel stack, potentially including addresses of kernel structures, could still be extracted from that 512 bytes large window. The BPF stack is special compared to map values since it's not zero initialized for every program invocation, whereas map values /are/ zero initialized upon their initial allocation and thus cannot leak any prior data in either domain. In the non-speculative domain, the verifier ensures that every stack slot read must have a prior stack slot write by the BPF program to avoid such data leaking issue. However, this is not enough: for example, when the pointer arithmetic operation moves the stack pointer from the last valid stack offset to the first valid offset, the sanitation logic allows for any intermediate offsets during speculative execution, which could then be used to extract any restricted stack content via side-channel. Given for unprivileged stack pointer arithmetic the use of unknown but bounded scalars is generally forbidden, we can simply turn the register-based arithmetic operation into an immediate-based arithmetic operation without the need for masking. This also gives the benefit of reducing the needed instructions for the operation. Given after the work in 7fedb63a8307 ("bpf: Tighten speculative pointer arithmetic mask"), the aux->alu_limit already holds the final immediate value for the offset register with the known scalar. Thus, a simple mov of the immediate to AX register with using AX as the source for the original instruction is sufficient and possible now in this case. Reported-by: Piotr Krysiuk <piotras@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Piotr Krysiuk <piotras@gmail.com> Reviewed-by: Piotr Krysiuk <piotras@gmail.com> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org>
2021-05-03bpf: Fix masking negation logic upon negative dst registerDaniel Borkmann
The negation logic for the case where the off_reg is sitting in the dst register is not correct given then we cannot just invert the add to a sub or vice versa. As a fix, perform the final bitwise and-op unconditionally into AX from the off_reg, then move the pointer from the src to dst and finally use AX as the source for the original pointer arithmetic operation such that the inversion yields a correct result. The single non-AX mov in between is possible given constant blinding is retaining it as it's not an immediate based operation. Fixes: 979d63d50c0c ("bpf: prevent out of bounds speculation on pointer arithmetic") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Piotr Krysiuk <piotras@gmail.com> Reviewed-by: Piotr Krysiuk <piotras@gmail.com> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org>
2021-04-29Merge tag 'net-next-5.13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core: - bpf: - allow bpf programs calling kernel functions (initially to reuse TCP congestion control implementations) - enable task local storage for tracing programs - remove the need to store per-task state in hash maps, and allow tracing programs access to task local storage previously added for BPF_LSM - add bpf_for_each_map_elem() helper, allowing programs to walk all map elements in a more robust and easier to verify fashion - sockmap: support UDP and cross-protocol BPF_SK_SKB_VERDICT redirection - lpm: add support for batched ops in LPM trie - add BTF_KIND_FLOAT support - mostly to allow use of BTF on s390 which has floats in its headers files - improve BPF syscall documentation and extend the use of kdoc parsing scripts we already employ for bpf-helpers - libbpf, bpftool: support static linking of BPF ELF files - improve support for encapsulation of L2 packets - xdp: restructure redirect actions to avoid a runtime lookup, improving performance by 4-8% in microbenchmarks - xsk: build skb by page (aka generic zerocopy xmit) - improve performance of software AF_XDP path by 33% for devices which don't need headers in the linear skb part (e.g. virtio) - nexthop: resilient next-hop groups - improve path stability on next-hops group changes (incl. offload for mlxsw) - ipv6: segment routing: add support for IPv4 decapsulation - icmp: add support for RFC 8335 extended PROBE messages - inet: use bigger hash table for IP ID generation - tcp: deal better with delayed TX completions - make sure we don't give up on fast TCP retransmissions only because driver is slow in reporting that it completed transmitting the original - tcp: reorder tcp_congestion_ops for better cache locality - mptcp: - add sockopt support for common TCP options - add support for common TCP msg flags - include multiple address ids in RM_ADDR - add reset option support for resetting one subflow - udp: GRO L4 improvements - improve 'forward' / 'frag_list' co-existence with UDP tunnel GRO, allowing the first to take place correctly even for encapsulated UDP traffic - micro-optimize dev_gro_receive() and flow dissection, avoid retpoline overhead on VLAN and TEB GRO - use less memory for sysctls, add a new sysctl type, to allow using u8 instead of "int" and "long" and shrink networking sysctls - veth: allow GRO without XDP - this allows aggregating UDP packets before handing them off to routing, bridge, OvS, etc. - allow specifing ifindex when device is moved to another namespace - netfilter: - nft_socket: add support for cgroupsv2 - nftables: add catch-all set element - special element used to define a default action in case normal lookup missed - use net_generic infra in many modules to avoid allocating per-ns memory unnecessarily - xps: improve the xps handling to avoid potential out-of-bound accesses and use-after-free when XPS change race with other re-configuration under traffic - add a config knob to turn off per-cpu netdev refcnt to catch underflows in testing Device APIs: - add WWAN subsystem to organize the WWAN interfaces better and hopefully start driving towards more unified and vendor- independent APIs - ethtool: - add interface for reading IEEE MIB stats (incl. mlx5 and bnxt support) - allow network drivers to dump arbitrary SFP EEPROM data, current offset+length API was a poor fit for modern SFP which define EEPROM in terms of pages (incl. mlx5 support) - act_police, flow_offload: add support for packet-per-second policing (incl. offload for nfp) - psample: add additional metadata attributes like transit delay for packets sampled from switch HW (and corresponding egress and policy-based sampling in the mlxsw driver) - dsa: improve support for sandwiched LAGs with bridge and DSA - netfilter: - flowtable: use direct xmit in topologies with IP forwarding, bridging, vlans etc. - nftables: counter hardware offload support - Bluetooth: - improvements for firmware download w/ Intel devices - add support for reading AOSP vendor capabilities - add support for virtio transport driver - mac80211: - allow concurrent monitor iface and ethernet rx decap - set priority and queue mapping for injected frames - phy: add support for Clause-45 PHY Loopback - pci/iov: add sysfs MSI-X vector assignment interface to distribute MSI-X resources to VFs (incl. mlx5 support) New hardware/drivers: - dsa: mv88e6xxx: add support for Marvell mv88e6393x - 11-port Ethernet switch with 8x 1-Gigabit Ethernet and 3x 10-Gigabit interfaces. - dsa: support for legacy Broadcom tags used on BCM5325, BCM5365 and BCM63xx switches - Microchip KSZ8863 and KSZ8873; 3x 10/100Mbps Ethernet switches - ath11k: support for QCN9074 a 802.11ax device - Bluetooth: Broadcom BCM4330 and BMC4334 - phy: Marvell 88X2222 transceiver support - mdio: add BCM6368 MDIO mux bus controller - r8152: support RTL8153 and RTL8156 (USB Ethernet) chips - mana: driver for Microsoft Azure Network Adapter (MANA) - Actions Semi Owl Ethernet MAC - can: driver for ETAS ES58X CAN/USB interfaces Pure driver changes: - add XDP support to: enetc, igc, stmmac - add AF_XDP support to: stmmac - virtio: - page_to_skb() use build_skb when there's sufficient tailroom (21% improvement for 1000B UDP frames) - support XDP even without dedicated Tx queues - share the Tx queues with the stack when necessary - mlx5: - flow rules: add support for mirroring with conntrack, matching on ICMP, GTP, flex filters and more - support packet sampling with flow offloads - persist uplink representor netdev across eswitch mode changes - allow coexistence of CQE compression and HW time-stamping - add ethtool extended link error state reporting - ice, iavf: support flow filters, UDP Segmentation Offload - dpaa2-switch: - move the driver out of staging - add spanning tree (STP) support - add rx copybreak support - add tc flower hardware offload on ingress traffic - ionic: - implement Rx page reuse - support HW PTP time-stamping - octeon: support TC hardware offloads - flower matching on ingress and egress ratelimitting. - stmmac: - add RX frame steering based on VLAN priority in tc flower - support frame preemption (FPE) - intel: add cross time-stamping freq difference adjustment - ocelot: - support forwarding of MRP frames in HW - support multiple bridges - support PTP Sync one-step timestamping - dsa: mv88e6xxx, dpaa2-switch: offload bridge port flags like learning, flooding etc. - ipa: add IPA v4.5, v4.9 and v4.11 support (Qualcomm SDX55, SM8350, SC7280 SoCs) - mt7601u: enable TDLS support - mt76: - add support for 802.3 rx frames (mt7915/mt7615) - mt7915 flash pre-calibration support - mt7921/mt7663 runtime power management fixes" * tag 'net-next-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2451 commits) net: selftest: fix build issue if INET is disabled net: netrom: nr_in: Remove redundant assignment to ns net: tun: Remove redundant assignment to ret net: phy: marvell: add downshift support for M88E1240 net: dsa: ksz: Make reg_mib_cnt a u8 as it never exceeds 255 net/sched: act_ct: Remove redundant ct get and check icmp: standardize naming of RFC 8335 PROBE constants bpf, selftests: Update array map tests for per-cpu batched ops bpf: Add batched ops support for percpu array bpf: Implement formatted output helpers with bstr_printf seq_file: Add a seq_bprintf function sfc: adjust efx->xdp_tx_queue_count with the real number of initialized queues net:nfc:digital: Fix a double free in digital_tg_recv_dep_req net: fix a concurrency bug in l2tp_tunnel_register() net/smc: Remove redundant assignment to rc mpls: Remove redundant assignment to err llc2: Remove redundant assignment to rc net/tls: Remove redundant initialization of record rds: Remove redundant assignment to nr_sig dt-bindings: net: mdio-gpio: add compatible for microchip,mdio-smi0 ...
2021-04-28bpf: Add batched ops support for percpu arrayPedro Tammela
Uses the already in-place infrastructure provided by the 'generic_map_*_batch' functions. No tweak was needed as it transparently handles the percpu variant. As arrays don't have delete operations, let it return a error to user space (default behaviour). Suggested-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210424214510.806627-2-pctammela@mojatatu.com
2021-04-27bpf: Implement formatted output helpers with bstr_printfFlorent Revest
BPF has three formatted output helpers: bpf_trace_printk, bpf_seq_printf and bpf_snprintf. Their signatures specify that all arguments are provided from the BPF world as u64s (in an array or as registers). All of these helpers are currently implemented by calling functions such as snprintf() whose signatures take a variable number of arguments, then placed in a va_list by the compiler to call vsnprintf(). "d9c9e4db bpf: Factorize bpf_trace_printk and bpf_seq_printf" introduced a bpf_printf_prepare function that fills an array of u64 sanitized arguments with an array of "modifiers" which indicate what the "real" size of each argument should be (given by the format specifier). The BPF_CAST_FMT_ARG macro consumes these arrays and casts each argument to its real size. However, the C promotion rules implicitely cast them all back to u64s. Therefore, the arguments given to snprintf are u64s and the va_list constructed by the compiler will use 64 bits for each argument. On 64 bit machines, this happens to work well because 32 bit arguments in va_lists need to occupy 64 bits anyway, but on 32 bit architectures this breaks the layout of the va_list expected by the called function and mangles values. In "88a5c690b6 bpf: fix bpf_trace_printk on 32 bit archs", this problem had been solved for bpf_trace_printk only with a "horrid workaround" that emitted multiple calls to trace_printk where each call had different argument types and generated different va_list layouts. One of the call would be dynamically chosen at runtime. This was ok with the 3 arguments that bpf_trace_printk takes but bpf_seq_printf and bpf_snprintf accept up to 12 arguments. Because this approach scales code exponentially, it is not a viable option anymore. Because the promotion rules are part of the language and because the construction of a va_list is an arch-specific ABI, it's best to just avoid variadic arguments and va_lists altogether. Thankfully the kernel's snprintf() has an alternative in the form of bstr_printf() that accepts arguments in a "binary buffer representation". These binary buffers are currently created by vbin_printf and used in the tracing subsystem to split the cost of printing into two parts: a fast one that only dereferences and remembers values, and a slower one, called later, that does the pretty-printing. This patch refactors bpf_printf_prepare to construct binary buffers of arguments consumable by bstr_printf() instead of arrays of arguments and modifiers. This gets rid of BPF_CAST_FMT_ARG and greatly simplifies the bpf_printf_prepare usage but there are a few gotchas that change how bpf_printf_prepare needs to do things. Currently, bpf_printf_prepare uses a per cpu temporary buffer as a generic storage for strings and IP addresses. With this refactoring, the temporary buffers now holds all the arguments in a structured binary format. To comply with the format expected by bstr_printf, certain format specifiers also need to be pre-formatted: %pB and %pi6/%pi4/%pI4/%pI6. Because vsnprintf subroutines for these specifiers are hard to expose, we pre-format these arguments with calls to snprintf(). Reported-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Florent Revest <revest@chromium.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210427174313.860948-3-revest@chromium.org
2021-04-27Merge tag 'selinux-pr-20210426' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux Pull selinux updates from Paul Moore: - Add support for measuring the SELinux state and policy capabilities using IMA. - A handful of SELinux/NFS patches to compare the SELinux state of one mount with a set of mount options. Olga goes into more detail in the patch descriptions, but this is important as it allows more flexibility when using NFS and SELinux context mounts. - Properly differentiate between the subjective and objective LSM credentials; including support for the SELinux and Smack. My clumsy attempt at a proper fix for AppArmor didn't quite pass muster so John is working on a proper AppArmor patch, in the meantime this set of patches shouldn't change the behavior of AppArmor in any way. This change explains the bulk of the diffstat beyond security/. - Fix a problem where we were not properly terminating the permission list for two SELinux object classes. * tag 'selinux-pr-20210426' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux: selinux: add proper NULL termination to the secclass_map permissions smack: differentiate between subjective and objective task credentials selinux: clarify task subjective and objective credentials lsm: separate security_task_getsecid() into subjective and objective variants nfs: account for selinux security context when deciding to share superblock nfs: remove unneeded null check in nfs_fill_super() lsm,selinux: add new hook to compare new mount to an existing mount selinux: fix misspellings using codespell tool selinux: fix misspellings using codespell tool selinux: measure state and policy capabilities selinux: Allow context mounts for unpriviliged overlayfs
2021-04-27bpf, cpumap: Bulk skb using netif_receive_skb_listLorenzo Bianconi
Rely on netif_receive_skb_list routine to send skbs converted from xdp_frames in cpu_map_kthread_run in order to improve i-cache usage. The proposed patch has been tested running xdp_redirect_cpu bpf sample available in the kernel tree that is used to redirect UDP frames from ixgbe driver to a cpumap entry and then to the networking stack. UDP frames are generated using pktgen. Packets are discarded by the UDP layer. $ xdp_redirect_cpu --cpu <cpu> --progname xdp_cpu_map0 --dev <eth> bpf-next: ~2.35Mpps bpf-next + cpumap skb-list: ~2.72Mpps Rename drops counter in kmem_alloc_drops since now it reports just kmem_cache_alloc_bulk failures Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Link: https://lore.kernel.org/bpf/c729f83e5d7482d9329e0f165bdbe5adcefd1510.1619169700.git.lorenzo@kernel.org
2021-04-27bpf: Fix propagation of 32 bit unsigned bounds from 64 bit boundsDaniel Borkmann
Similarly as b02709587ea3 ("bpf: Fix propagation of 32-bit signed bounds from 64-bit bounds."), we also need to fix the propagation of 32 bit unsigned bounds from 64 bit counterparts. That is, really only set the u32_{min,max}_value when /both/ {umin,umax}_value safely fit in 32 bit space. For example, the register with a umin_value == 1 does /not/ imply that u32_min_value is also equal to 1, since umax_value could be much larger than 32 bit subregister can hold, and thus u32_min_value is in the interval [0,1] instead. Before fix, invalid tracking result of R2_w=inv1: [...] 5: R0_w=inv1337 R1=ctx(id=0,off=0,imm=0) R2_w=inv(id=0) R10=fp0 5: (35) if r2 >= 0x1 goto pc+1 [...] // goto path 7: R0=inv1337 R1=ctx(id=0,off=0,imm=0) R2=inv(id=0,umin_value=1) R10=fp0 7: (b6) if w2 <= 0x1 goto pc+1 [...] // goto path 9: R0=inv1337 R1=ctx(id=0,off=0,imm=0) R2=inv(id=0,smin_value=-9223372036854775807,smax_value=9223372032559808513,umin_value=1,umax_value=18446744069414584321,var_off=(0x1; 0xffffffff00000000),s32_min_value=1,s32_max_value=1,u32_max_value=1) R10=fp0 9: (bc) w2 = w2 10: R0=inv1337 R1=ctx(id=0,off=0,imm=0) R2_w=inv1 R10=fp0 [...] After fix, correct tracking result of R2_w=inv(id=0,umax_value=1,var_off=(0x0; 0x1)): [...] 5: R0_w=inv1337 R1=ctx(id=0,off=0,imm=0) R2_w=inv(id=0) R10=fp0 5: (35) if r2 >= 0x1 goto pc+1 [...] // goto path 7: R0=inv1337 R1=ctx(id=0,off=0,imm=0) R2=inv(id=0,umin_value=1) R10=fp0 7: (b6) if w2 <= 0x1 goto pc+1 [...] // goto path 9: R0=inv1337 R1=ctx(id=0,off=0,imm=0) R2=inv(id=0,smax_value=9223372032559808513,umax_value=18446744069414584321,var_off=(0x0; 0xffffffff00000001),s32_min_value=0,s32_max_value=1,u32_max_value=1) R10=fp0 9: (bc) w2 = w2 10: R0=inv1337 R1=ctx(id=0,off=0,imm=0) R2_w=inv(id=0,umax_value=1,var_off=(0x0; 0x1)) R10=fp0 [...] Thus, same issue as in b02709587ea3 holds for unsigned subregister tracking. Also, align __reg64_bound_u32() similarly to __reg64_bound_s32() as done in b02709587ea3 to make them uniform again. Fixes: 3f50f132d840 ("bpf: Verifier, do explicit ALU32 bounds tracking") Reported-by: Manfred Paul (@_manfp) Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org>
2021-04-25bpf: Allow trampoline re-attach for tracing and lsm programsJiri Olsa
Currently we don't allow re-attaching of trampolines. Once it's detached, it can't be re-attach even when the program is still loaded. Adding the possibility to re-attach the loaded tracing and lsm programs. Fixing missing unlock with proper cleanup goto jump reported by Julia. Reported-by: kernel test robot <lkp@intel.com> Reported-by: Julia Lawall <julia.lawall@lip6.fr> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: KP Singh <kpsingh@kernel.org> Link: https://lore.kernel.org/bpf/20210414195147.1624932-2-jolsa@kernel.org
2021-04-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Alexei Starovoitov says: ==================== pull-request: bpf-next 2021-04-23 The following pull-request contains BPF updates for your *net-next* tree. We've added 69 non-merge commits during the last 22 day(s) which contain a total of 69 files changed, 3141 insertions(+), 866 deletions(-). The main changes are: 1) Add BPF static linker support for extern resolution of global, from Andrii. 2) Refine retval for bpf_get_task_stack helper, from Dave. 3) Add a bpf_snprintf helper, from Florent. 4) A bunch of miscellaneous improvements from many developers. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-23bpf: Remove unnecessary map checks for ARG_PTR_TO_CONST_STRFlorent Revest
reg->type is enforced by check_reg_type() and map should never be NULL (it would already have been dereferenced anyway) so these checks are unnecessary. Reported-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Florent Revest <revest@chromium.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210422235543.4007694-3-revest@chromium.org
2021-04-23bpf: Notify user if we ever hit a bpf_snprintf verifier bugFlorent Revest
In check_bpf_snprintf_call(), a map_direct_value_addr() of the fmt map should never fail because it has already been checked by ARG_PTR_TO_CONST_STR. But if it ever fails, it's better to error out with an explicit debug message rather than silently fail. Reported-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Florent Revest <revest@chromium.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210422235543.4007694-2-revest@chromium.org
2021-04-19bpf: Refine retval for bpf_get_task_stack helperDave Marchevsky
Verifier can constrain the min/max bounds of bpf_get_task_stack's return value more tightly than the default tnum_unknown. Like bpf_get_stack, return value is num bytes written into a caller-supplied buf, or error, so do_refine_retval_range will work. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20210416204704.2816874-2-davemarchevsky@fb.com
2021-04-19bpf: Add a bpf_snprintf helperFlorent Revest
The implementation takes inspiration from the existing bpf_trace_printk helper but there are a few differences: To allow for a large number of format-specifiers, parameters are provided in an array, like in bpf_seq_printf. Because the output string takes two arguments and the array of parameters also takes two arguments, the format string needs to fit in one argument. Thankfully, ARG_PTR_TO_CONST_STR is guaranteed to point to a zero-terminated read-only map so we don't need a format string length arg. Because the format-string is known at verification time, we also do a first pass of format string validation in the verifier logic. This makes debugging easier. Signed-off-by: Florent Revest <revest@chromium.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210419155243.1632274-4-revest@chromium.org
2021-04-19bpf: Add a ARG_PTR_TO_CONST_STR argument typeFlorent Revest
This type provides the guarantee that an argument is going to be a const pointer to somewhere in a read-only map value. It also checks that this pointer is followed by a zero character before the end of the map value. Signed-off-by: Florent Revest <revest@chromium.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210419155243.1632274-3-revest@chromium.org