Age | Commit message (Collapse) | Author |
|
Update BPF selftests to use the new RSS type argument for kfunc
bpf_xdp_metadata_rx_hash.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/168132894068.340624.8914711185697163690.stgit@firesoul
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
The tool xdp_hw_metadata can be used by driver developers
implementing XDP-hints metadata kfuncs.
Remove all bpf_printk calls, as the tool already transfers all the
XDP-hints related information via metadata area to AF_XDP
userspace process.
Add counters for providing remaining information about failure and
skipped packet events.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/168132891533.340624.7313781245316405141.stgit@firesoul
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
xdp-features supported by veth driver are no more static, but they
depends on veth configuration (e.g. if GRO is enabled/disabled or
TX/RX queue configuration). Take it into account in xdp_redirect
xdp-features selftest for veth driver.
Fixes: fccca038f300 ("veth: take into account device reconfiguration for xdp_features flag")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://lore.kernel.org/r/bc35455cfbb1d4f7f52536955ded81ad47d8dc54.1680777371.git.lorenzo@kernel.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
Merge commit bf9bec4cb3a4 ("Merge branch 'bpf: Allow reads from uninit stack'")
from bpf-next to bpf tree to address verification issues in some programs
due to stack usage.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
The following build error can be seen:
progs/test_deny_namespace.c:22:19: error: call to undeclared function 'BIT_LL'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
__u64 cap_mask = BIT_LL(CAP_SYS_ADMIN);
The struct kernel_cap_struct no longer exists in the kernel as well.
Adjust bpf prog to fix both issues.
Fixes: f122a08b197d ("capability: just use a 'u64' instead of a 'u32[2]' array")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
The commit 11e456cae91e ("selftests/bpf: Fix compilation errors: Assign a value to a constant")
fixed the issue cleanly in bpf-next.
This is an alternative fix in bpf tree to avoid merge conflict between bpf and bpf-next.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from netfilter and bpf.
Current release - regressions:
- core: avoid skb end_offset change in __skb_unclone_keeptruesize()
- sched:
- act_connmark: handle errno on tcf_idr_check_alloc
- flower: fix fl_change() error recovery path
- ieee802154: prevent user from crashing the host
Current release - new code bugs:
- eth: bnxt_en: fix the double free during device removal
- tools: ynl:
- fix enum-as-flags in the generic CLI
- fully inherit attrs in subsets
- re-license uniformly under GPL-2.0 or BSD-3-clause
Previous releases - regressions:
- core: use indirect calls helpers for sk_exit_memory_pressure()
- tls:
- fix return value for async crypto
- avoid hanging tasks on the tx_lock
- eth: ice: copy last block omitted in ice_get_module_eeprom()
Previous releases - always broken:
- core: avoid double iput when sock_alloc_file fails
- af_unix: fix struct pid leaks in OOB support
- tls:
- fix possible race condition
- fix device-offloaded sendpage straddling records
- bpf:
- sockmap: fix an infinite loop error
- test_run: fix &xdp_frame misplacement for LIVE_FRAMES
- fix resolving BTF_KIND_VAR after ARRAY, STRUCT, UNION, PTR
- netfilter: tproxy: fix deadlock due to missing BH disable
- phylib: get rid of unnecessary locking
- eth: bgmac: fix *initial* chip reset to support BCM5358
- eth: nfp: fix csum for ipsec offload
- eth: mtk_eth_soc: fix RX data corruption issue
Misc:
- usb: qmi_wwan: add telit 0x1080 composition"
* tag 'net-6.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (64 commits)
tools: ynl: fix enum-as-flags in the generic CLI
tools: ynl: move the enum classes to shared code
net: avoid double iput when sock_alloc_file fails
af_unix: fix struct pid leaks in OOB support
eth: fealnx: bring back this old driver
net: dsa: mt7530: permit port 5 to work without port 6 on MT7621 SoC
net: microchip: sparx5: fix deletion of existing DSCP mappings
octeontx2-af: Unlock contexts in the queue context cache in case of fault detection
net/smc: fix fallback failed while sendmsg with fastopen
ynl: re-license uniformly under GPL-2.0 OR BSD-3-Clause
mailmap: update entries for Stephen Hemminger
mailmap: add entry for Maxim Mikityanskiy
nfc: change order inside nfc_se_io error path
ethernet: ice: avoid gcc-9 integer overflow warning
ice: don't ignore return codes in VSI related code
ice: Fix DSCP PFC TLV creation
net: usb: qmi_wwan: add Telit 0x1080 composition
net: usb: cdc_mbim: avoid altsetting toggling for Telit FE990
netfilter: conntrack: adopt safer max chain length
net: tls: fix device-offloaded sendpage straddling records
...
|
|
Add a regression test that ensures that a VAR pointing at a
modifier which follows a PTR (or STRUCT or ARRAY) is resolved
correctly by the datasec validator.
Signed-off-by: Lorenz Bauer <lmb@isovalent.com>
Link: https://lore.kernel.org/r/20230306112138.155352-3-lmb@isovalent.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
&xdp_buff and &xdp_frame are bound in a way that
xdp_buff->data_hard_start == xdp_frame
It's always the case and e.g. xdp_convert_buff_to_frame() relies on
this.
IOW, the following:
for (u32 i = 0; i < 0xdead; i++) {
xdpf = xdp_convert_buff_to_frame(&xdp);
xdp_convert_frame_to_buff(xdpf, &xdp);
}
shouldn't ever modify @xdpf's contents or the pointer itself.
However, "live packet" code wrongly treats &xdp_frame as part of its
context placed *before* the data_hard_start. With such flow,
data_hard_start is sizeof(*xdpf) off to the right and no longer points
to the XDP frame.
Instead of replacing `sizeof(ctx)` with `offsetof(ctx, xdpf)` in several
places and praying that there are no more miscalcs left somewhere in the
code, unionize ::frm with ::data in a flex array, so that both starts
pointing to the actual data_hard_start and the XDP frame actually starts
being a part of it, i.e. a part of the headroom, not the context.
A nice side effect is that the maximum frame size for this mode gets
increased by 40 bytes, as xdp_buff::frame_sz includes everything from
data_hard_start (-> includes xdpf already) to the end of XDP/skb shared
info.
Also update %MAX_PKT_SIZE accordingly in the selftests code. Leave it
hardcoded for 64 bit && 4k pages, it can be made more flexible later on.
Minor: align `&head->data` with how `head->frm` is assigned for
consistency.
Minor #2: rename 'frm' to 'frame' in &xdp_page_head while at it for
clarity.
(was found while testing XDP traffic generator on ice, which calls
xdp_convert_frame_to_buff() for each XDP frame)
Fixes: b530e9e1063e ("bpf: Add "live packet" mode for XDP in BPF_PROG_RUN")
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://lore.kernel.org/r/20230224163607.2994755-1-aleksander.lobakin@intel.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
Back in 2008 we extended the capability bits from 32 to 64, and we did
it by extending the single 32-bit capability word from one word to an
array of two words. It was then obfuscated by hiding the "2" behind two
macro expansions, with the reasoning being that maybe it gets extended
further some day.
That reasoning may have been valid at the time, but the last thing we
want to do is to extend the capability set any more. And the array of
values not only causes source code oddities (with loops to deal with
it), but also results in worse code generation. It's a lose-lose
situation.
So just change the 'u32[2]' into a 'u64' and be done with it.
We still have to deal with the fact that the user space interface is
designed around an array of these 32-bit values, but that was the case
before too, since the array layouts were different (ie user space
doesn't use an array of 32-bit values for individual capability masks,
but an array of 32-bit slices of multiple masks).
So that marshalling of data is actually simplified too, even if it does
remain somewhat obscure and odd.
This was all triggered by my reaction to the new "cap_isidentical()"
introduced recently. By just using a saner data structure, it went from
unsigned __capi;
CAP_FOR_EACH_U32(__capi) {
if (a.cap[__capi] != b.cap[__capi])
return false;
}
return true;
to just being
return a.val == b.val;
instead. Which is rather more obvious both to humans and to compilers.
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Paul Moore <paul@paul-moore.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Three testcases to make sure that stack reads from uninitialized
locations are accepted by verifier when executed in privileged mode:
- read from a fixed offset;
- read from a variable offset;
- passing a pointer to stack to a helper converts
STACK_INVALID to STACK_MISC.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230219200427.606541-3-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
This commits updates the following functions to allow reads from
uninitialized stack locations when env->allow_uninit_stack option is
enabled:
- check_stack_read_fixed_off()
- check_stack_range_initialized(), called from:
- check_stack_read_var_off()
- check_helper_mem_access()
Such change allows to relax logic in stacksafe() to treat STACK_MISC
and STACK_INVALID in a same way and make the following stack slot
configurations equivalent:
| Cached state | Current state |
| stack slot | stack slot |
|------------------+------------------|
| STACK_INVALID or | STACK_INVALID or |
| STACK_MISC | STACK_SPILL or |
| | STACK_MISC or |
| | STACK_ZERO or |
| | STACK_DYNPTR |
This leads to significant verification speed gains (see below).
The idea was suggested by Andrii Nakryiko [1] and initial patch was
created by Alexei Starovoitov [2].
Currently the env->allow_uninit_stack is allowed for programs loaded
by users with CAP_PERFMON or CAP_SYS_ADMIN capabilities.
A number of test cases from verifier/*.c were expecting uninitialized
stack access to be an error. These test cases were updated to execute
in unprivileged mode (thus preserving the tests).
The test progs/test_global_func10.c expected "invalid indirect read
from stack" error message because of the access to uninitialized
memory region. This error is no longer possible in privileged mode.
The test is updated to provoke an error "invalid indirect access to
stack" because of access to invalid stack address (such error is not
verified by progs/test_global_func*.c series of tests).
The following tests had to be removed because these can't be made
unprivileged:
- verifier/sock.c:
- "sk_storage_get(map, skb->sk, &stack_value, 1): partially init
stack_value"
BPF_PROG_TYPE_SCHED_CLS programs are not executed in unprivileged mode.
- verifier/var_off.c:
- "indirect variable-offset stack access, max_off+size > max_initialized"
- "indirect variable-offset stack access, uninitialized"
These tests verify that access to uninitialized stack values is
detected when stack offset is not a constant. However, variable
stack access is prohibited in unprivileged mode, thus these tests
are no longer valid.
* * *
Here is veristat log comparing this patch with current master on a
set of selftest binaries listed in tools/testing/selftests/bpf/veristat.cfg
and cilium BPF binaries (see [3]):
$ ./veristat -e file,prog,states -C -f 'states_pct<-30' master.log current.log
File Program States (A) States (B) States (DIFF)
-------------------------- -------------------------- ---------- ---------- ----------------
bpf_host.o tail_handle_ipv6_from_host 349 244 -105 (-30.09%)
bpf_host.o tail_handle_nat_fwd_ipv4 1320 895 -425 (-32.20%)
bpf_lxc.o tail_handle_nat_fwd_ipv4 1320 895 -425 (-32.20%)
bpf_sock.o cil_sock4_connect 70 48 -22 (-31.43%)
bpf_sock.o cil_sock4_sendmsg 68 46 -22 (-32.35%)
bpf_xdp.o tail_handle_nat_fwd_ipv4 1554 803 -751 (-48.33%)
bpf_xdp.o tail_lb_ipv4 6457 2473 -3984 (-61.70%)
bpf_xdp.o tail_lb_ipv6 7249 3908 -3341 (-46.09%)
pyperf600_bpf_loop.bpf.o on_event 287 145 -142 (-49.48%)
strobemeta.bpf.o on_event 15915 4772 -11143 (-70.02%)
strobemeta_nounroll2.bpf.o on_event 17087 3820 -13267 (-77.64%)
xdp_synproxy_kern.bpf.o syncookie_tc 21271 6635 -14636 (-68.81%)
xdp_synproxy_kern.bpf.o syncookie_xdp 23122 6024 -17098 (-73.95%)
-------------------------- -------------------------- ---------- ---------- ----------------
Note: I limited selection by states_pct<-30%.
Inspection of differences in pyperf600_bpf_loop behavior shows that
the following patch for the test removes almost all differences:
- a/tools/testing/selftests/bpf/progs/pyperf.h
+ b/tools/testing/selftests/bpf/progs/pyperf.h
@ -266,8 +266,8 @ int __on_event(struct bpf_raw_tracepoint_args *ctx)
}
if (event->pthread_match || !pidData->use_tls) {
- void* frame_ptr;
- FrameData frame;
+ void* frame_ptr = 0;
+ FrameData frame = {};
Symbol sym = {};
int cur_cpu = bpf_get_smp_processor_id();
W/o this patch the difference comes from the following pattern
(for different variables):
static bool get_frame_data(... FrameData *frame ...)
{
...
bpf_probe_read_user(&frame->f_code, ...);
if (!frame->f_code)
return false;
...
bpf_probe_read_user(&frame->co_name, ...);
if (frame->co_name)
...;
}
int __on_event(struct bpf_raw_tracepoint_args *ctx)
{
FrameData frame;
...
get_frame_data(... &frame ...) // indirectly via a bpf_loop & callback
...
}
SEC("raw_tracepoint/kfree_skb")
int on_event(struct bpf_raw_tracepoint_args* ctx)
{
...
ret |= __on_event(ctx);
ret |= __on_event(ctx);
...
}
With regards to value `frame->co_name` the following is important:
- Because of the conditional `if (!frame->f_code)` each call to
__on_event() produces two states, one with `frame->co_name` marked
as STACK_MISC, another with it as is (and marked STACK_INVALID on a
first call).
- The call to bpf_probe_read_user() does not mark stack slots
corresponding to `&frame->co_name` as REG_LIVE_WRITTEN but it marks
these slots as BPF_MISC, this happens because of the following loop
in the check_helper_call():
for (i = 0; i < meta.access_size; i++) {
err = check_mem_access(env, insn_idx, meta.regno, i, BPF_B,
BPF_WRITE, -1, false);
if (err)
return err;
}
Note the size of the write, it is a one byte write for each byte
touched by a helper. The BPF_B write does not lead to write marks
for the target stack slot.
- Which means that w/o this patch when second __on_event() call is
verified `if (frame->co_name)` will propagate read marks first to a
stack slot with STACK_MISC marks and second to a stack slot with
STACK_INVALID marks and these states would be considered different.
[1] https://lore.kernel.org/bpf/CAEf4BzY3e+ZuC6HUa8dCiUovQRg2SzEk7M-dSkqNZyn=xEmnPA@mail.gmail.com/
[2] https://lore.kernel.org/bpf/CAADnVQKs2i1iuZ5SUGuJtxWVfGYR9kDgYKhq3rNV+kBLQCu7rA@mail.gmail.com/
[3] git@github.com:anakryiko/cilium.git
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Co-developed-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230219200427.606541-2-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core:
- Add dedicated kmem_cache for typical/small skb->head, avoid having
to access struct page at kfree time, and improve memory use.
- Introduce sysctl to set default RPS configuration for new netdevs.
- Define Netlink protocol specification format which can be used to
describe messages used by each family and auto-generate parsers.
Add tools for generating kernel data structures and uAPI headers.
- Expose all net/core sysctls inside netns.
- Remove 4s sleep in netpoll if carrier is instantly detected on
boot.
- Add configurable limit of MDB entries per port, and port-vlan.
- Continue populating drop reasons throughout the stack.
- Retire a handful of legacy Qdiscs and classifiers.
Protocols:
- Support IPv4 big TCP (TSO frames larger than 64kB).
- Add IP_LOCAL_PORT_RANGE socket option, to control local port range
on socket by socket basis.
- Track and report in procfs number of MPTCP sockets used.
- Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP path
manager.
- IPv6: don't check net.ipv6.route.max_size and rely on garbage
collection to free memory (similarly to IPv4).
- Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986).
- ICMP: add per-rate limit counters.
- Add support for user scanning requests in ieee802154.
- Remove static WEP support.
- Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate
reporting.
- WiFi 7 EHT channel puncturing support (client & AP).
BPF:
- Add a rbtree data structure following the "next-gen data structure"
precedent set by recently added linked list, that is, by using
kfunc + kptr instead of adding a new BPF map type.
- Expose XDP hints via kfuncs with initial support for RX hash and
timestamp metadata.
- Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to
better support decap on GRE tunnel devices not operating in collect
metadata.
- Improve x86 JIT's codegen for PROBE_MEM runtime error checks.
- Remove the need for trace_printk_lock for bpf_trace_printk and
bpf_trace_vprintk helpers.
- Extend libbpf's bpf_tracing.h support for tracing arguments of
kprobes/uprobes and syscall as a special case.
- Significantly reduce the search time for module symbols by
livepatch and BPF.
- Enable cpumasks to be used as kptrs, which is useful for tracing
programs tracking which tasks end up running on which CPUs in
different time intervals.
- Add support for BPF trampoline on s390x and riscv64.
- Add capability to export the XDP features supported by the NIC.
- Add __bpf_kfunc tag for marking kernel functions as kfuncs.
- Add cgroup.memory=nobpf kernel parameter option to disable BPF
memory accounting for container environments.
Netfilter:
- Remove the CLUSTERIP target. It has been marked as obsolete for
years, and we still have WARN splats wrt races of the out-of-band
/proc interface installed by this target.
- Add 'destroy' commands to nf_tables. They are identical to the
existing 'delete' commands, but do not return an error if the
referenced object (set, chain, rule...) did not exist.
Driver API:
- Improve cpumask_local_spread() locality to help NICs set the right
IRQ affinity on AMD platforms.
- Separate C22 and C45 MDIO bus transactions more clearly.
- Introduce new DCB table to control DSCP rewrite on egress.
- Support configuration of Physical Layer Collision Avoidance (PLCA)
Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of
shared medium Ethernet.
- Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing
preemption of low priority frames by high priority frames.
- Add support for controlling MACSec offload using netlink SET.
- Rework devlink instance refcounts to allow registration and
de-registration under the instance lock. Split the code into
multiple files, drop some of the unnecessarily granular locks and
factor out common parts of netlink operation handling.
- Add TX frame aggregation parameters (for USB drivers).
- Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning
messages with notifications for debug.
- Allow offloading of UDP NEW connections via act_ct.
- Add support for per action HW stats in TC.
- Support hardware miss to TC action (continue processing in SW from
a specific point in the action chain).
- Warn if old Wireless Extension user space interface is used with
modern cfg80211/mac80211 drivers. Do not support Wireless
Extensions for Wi-Fi 7 devices at all. Everyone should switch to
using nl80211 interface instead.
- Improve the CAN bit timing configuration. Use extack to return
error messages directly to user space, update the SJW handling,
including the definition of a new default value that will benefit
CAN-FD controllers, by increasing their oscillator tolerance.
New hardware / drivers:
- Ethernet:
- nVidia BlueField-3 support (control traffic driver)
- Ethernet support for imx93 SoCs
- Motorcomm yt8531 gigabit Ethernet PHY
- onsemi NCN26000 10BASE-T1S PHY (with support for PLCA)
- Microchip LAN8841 PHY (incl. cable diagnostics and PTP)
- Amlogic gxl MDIO mux
- WiFi:
- RealTek RTL8188EU (rtl8xxxu)
- Qualcomm Wi-Fi 7 devices (ath12k)
- CAN:
- Renesas R-Car V4H
Drivers:
- Bluetooth:
- Set Per Platform Antenna Gain (PPAG) for Intel controllers.
- Ethernet NICs:
- Intel (1G, igc):
- support TSN / Qbv / packet scheduling features of i226 model
- Intel (100G, ice):
- use GNSS subsystem instead of TTY
- multi-buffer XDP support
- extend support for GPIO pins to E823 devices
- nVidia/Mellanox:
- update the shared buffer configuration on PFC commands
- implement PTP adjphase function for HW offset control
- TC support for Geneve and GRE with VF tunnel offload
- more efficient crypto key management method
- multi-port eswitch support
- Netronome/Corigine:
- add DCB IEEE support
- support IPsec offloading for NFP3800
- Freescale/NXP (enetc):
- support XDP_REDIRECT for XDP non-linear buffers
- improve reconfig, avoid link flap and waiting for idle
- support MAC Merge layer
- Other NICs:
- sfc/ef100: add basic devlink support for ef100
- ionic: rx_push mode operation (writing descriptors via MMIO)
- bnxt: use the auxiliary bus abstraction for RDMA
- r8169: disable ASPM and reset bus in case of tx timeout
- cpsw: support QSGMII mode for J721e CPSW9G
- cpts: support pulse-per-second output
- ngbe: add an mdio bus driver
- usbnet: optimize usbnet_bh() by avoiding unnecessary queuing
- r8152: handle devices with FW with NCM support
- amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation
- virtio-net: support multi buffer XDP
- virtio/vsock: replace virtio_vsock_pkt with sk_buff
- tsnep: XDP support
- Ethernet high-speed switches:
- nVidia/Mellanox (mlxsw):
- add support for latency TLV (in FW control messages)
- Microchip (sparx5):
- separate explicit and implicit traffic forwarding rules, make
the implicit rules always active
- add support for egress DSCP rewrite
- IS0 VCAP support (Ingress Classification)
- IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS
etc.)
- ES2 VCAP support (Egress Access Control)
- support for Per-Stream Filtering and Policing (802.1Q,
8.6.5.1)
- Ethernet embedded switches:
- Marvell (mv88e6xxx):
- add MAB (port auth) offload support
- enable PTP receive for mv88e6390
- NXP (ocelot):
- support MAC Merge layer
- support for the the vsc7512 internal copper phys
- Microchip:
- lan9303: convert to PHYLINK
- lan966x: support TC flower filter statistics
- lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x
- lan937x: support Credit Based Shaper configuration
- ksz9477: support Energy Efficient Ethernet
- other:
- qca8k: convert to regmap read/write API, use bulk operations
- rswitch: Improve TX timestamp accuracy
- Intel WiFi (iwlwifi):
- EHT (Wi-Fi 7) rate reporting
- STEP equalizer support: transfer some STEP (connection to radio
on platforms with integrated wifi) related parameters from the
BIOS to the firmware.
- Qualcomm 802.11ax WiFi (ath11k):
- IPQ5018 support
- Fine Timing Measurement (FTM) responder role support
- channel 177 support
- MediaTek WiFi (mt76):
- per-PHY LED support
- mt7996: EHT (Wi-Fi 7) support
- Wireless Ethernet Dispatch (WED) reset support
- switch to using page pool allocator
- RealTek WiFi (rtw89):
- support new version of Bluetooth co-existance
- Mobile:
- rmnet: support TX aggregation"
* tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1872 commits)
page_pool: add a comment explaining the fragment counter usage
net: ethtool: fix __ethtool_dev_mm_supported() implementation
ethtool: pse-pd: Fix double word in comments
xsk: add linux/vmalloc.h to xsk.c
sefltests: netdevsim: wait for devlink instance after netns removal
selftest: fib_tests: Always cleanup before exit
net/mlx5e: Align IPsec ASO result memory to be as required by hardware
net/mlx5e: TC, Set CT miss to the specific ct action instance
net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG
net/mlx5: Refactor tc miss handling to a single function
net/mlx5: Kconfig: Make tc offload depend on tc skb extension
net/sched: flower: Support hardware miss to tc action
net/sched: flower: Move filter handle initialization earlier
net/sched: cls_api: Support hardware miss to tc action
net/sched: Rename user cookie and act cookie
sfc: fix builds without CONFIG_RTC_LIB
sfc: clean up some inconsistent indentings
net/mlx4_en: Introduce flexible array to silence overflow warning
net: lan966x: Fix possible deadlock inside PTP
net/ulp: Remove redundant ->clone() test in inet_clone_ulp().
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping
Pull vfs idmapping updates from Christian Brauner:
- Last cycle we introduced the dedicated struct mnt_idmap type for
mount idmapping and the required infrastucture in 256c8aed2b42 ("fs:
introduce dedicated idmap type for mounts"). As promised in last
cycle's pull request message this converts everything to rely on
struct mnt_idmap.
Currently we still pass around the plain namespace that was attached
to a mount. This is in general pretty convenient but it makes it easy
to conflate namespaces that are relevant on the filesystem with
namespaces that are relevant on the mount level. Especially for
non-vfs developers without detailed knowledge in this area this was a
potential source for bugs.
This finishes the conversion. Instead of passing the plain namespace
around this updates all places that currently take a pointer to a
mnt_userns with a pointer to struct mnt_idmap.
Now that the conversion is done all helpers down to the really
low-level helpers only accept a struct mnt_idmap argument instead of
two namespace arguments.
Conflating mount and other idmappings will now cause the compiler to
complain loudly thus eliminating the possibility of any bugs. This
makes it impossible for filesystem developers to mix up mount and
filesystem idmappings as they are two distinct types and require
distinct helpers that cannot be used interchangeably.
Everything associated with struct mnt_idmap is moved into a single
separate file. With that change no code can poke around in struct
mnt_idmap. It can only be interacted with through dedicated helpers.
That means all filesystems are and all of the vfs is completely
oblivious to the actual implementation of idmappings.
We are now also able to extend struct mnt_idmap as we see fit. For
example, we can decouple it completely from namespaces for users that
don't require or don't want to use them at all. We can also extend
the concept of idmappings so we can cover filesystem specific
requirements.
In combination with the vfs{g,u}id_t work we finished in v6.2 this
makes this feature substantially more robust and thus difficult to
implement wrong by a given filesystem and also protects the vfs.
- Enable idmapped mounts for tmpfs and fulfill a longstanding request.
A long-standing request from users had been to make it possible to
create idmapped mounts for tmpfs. For example, to share the host's
tmpfs mount between multiple sandboxes. This is a prerequisite for
some advanced Kubernetes cases. Systemd also has a range of use-cases
to increase service isolation. And there are more users of this.
However, with all of the other work going on this was way down on the
priority list but luckily someone other than ourselves picked this
up.
As usual the patch is tiny as all the infrastructure work had been
done multiple kernel releases ago. In addition to all the tests that
we already have I requested that Rodrigo add a dedicated tmpfs
testsuite for idmapped mounts to xfstests. It is to be included into
xfstests during the v6.3 development cycle. This should add a slew of
additional tests.
* tag 'fs.idmapped.v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: (26 commits)
shmem: support idmapped mounts for tmpfs
fs: move mnt_idmap
fs: port vfs{g,u}id helpers to mnt_idmap
fs: port fs{g,u}id helpers to mnt_idmap
fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmap
fs: port i_{g,u}id_{needs_}update() to mnt_idmap
quota: port to mnt_idmap
fs: port privilege checking helpers to mnt_idmap
fs: port inode_owner_or_capable() to mnt_idmap
fs: port inode_init_owner() to mnt_idmap
fs: port acl to mnt_idmap
fs: port xattr to mnt_idmap
fs: port ->permission() to pass mnt_idmap
fs: port ->fileattr_set() to pass mnt_idmap
fs: port ->set_acl() to pass mnt_idmap
fs: port ->get_acl() to pass mnt_idmap
fs: port ->tmpfile() to pass mnt_idmap
fs: port ->rename() to pass mnt_idmap
fs: port ->mknod() to pass mnt_idmap
fs: port ->mkdir() to pass mnt_idmap
...
|
|
This patch tests the bpf_fib_lookup helper when looking up
a neigh in NUD_FAILED and NUD_STALE state. It also adds test
for the new BPF_FIB_LOOKUP_SKIP_NEIGH flag.
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230217205515.3583372-2-martin.lau@linux.dev
|
|
This reverts commit 6c20822fada1b8adb77fa450d03a0d449686a4a9.
build bot failed on arch with different cache line size:
https://lore.kernel.org/bpf/50c35055-afa9-d01e-9a05-ea5351280e4f@intel.com/
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
Add tests validating that it's possible to pass context arguments into
global subprogs for various types of programs, including a particularly
tricky KPROBE programs (which cover kprobes, uprobes, USDTs, a vast and
important class of programs).
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20230216045954.3002473-4-andrii@kernel.org
|
|
Convert 17 test_global_funcs subtests into test_loader framework for
easier maintenance and more declarative way to define expected
failures/successes.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20230216045954.3002473-3-andrii@kernel.org
|
|
Run spell checker on files in selftest/bpf and fixed typos.
Signed-off-by: Taichi Nishimura <awkrail01@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lore.kernel.org/bpf/20230216085537.519062-1-awkrail01@gmail.com
|
|
Use the new type-safe wrappers around bpf_obj_get_info_by_fd().
Fix a prog/map mixup in prog_holds_map().
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230214231221.249277-6-iii@linux.ibm.com
|
|
&xdp_buff and &xdp_frame are bound in a way that
xdp_buff->data_hard_start == xdp_frame
It's always the case and e.g. xdp_convert_buff_to_frame() relies on
this.
IOW, the following:
for (u32 i = 0; i < 0xdead; i++) {
xdpf = xdp_convert_buff_to_frame(&xdp);
xdp_convert_frame_to_buff(xdpf, &xdp);
}
shouldn't ever modify @xdpf's contents or the pointer itself.
However, "live packet" code wrongly treats &xdp_frame as part of its
context placed *before* the data_hard_start. With such flow,
data_hard_start is sizeof(*xdpf) off to the right and no longer points
to the XDP frame.
Instead of replacing `sizeof(ctx)` with `offsetof(ctx, xdpf)` in several
places and praying that there are no more miscalcs left somewhere in the
code, unionize ::frm with ::data in a flex array, so that both starts
pointing to the actual data_hard_start and the XDP frame actually starts
being a part of it, i.e. a part of the headroom, not the context.
A nice side effect is that the maximum frame size for this mode gets
increased by 40 bytes, as xdp_buff::frame_sz includes everything from
data_hard_start (-> includes xdpf already) to the end of XDP/skb shared
info.
Also update %MAX_PKT_SIZE accordingly in the selftests code. Leave it
hardcoded for 64 bit && 4k pages, it can be made more flexible later on.
Minor: align `&head->data` with how `head->frm` is assigned for
consistency.
Minor #2: rename 'frm' to 'frame' in &xdp_page_head while at it for
clarity.
(was found while testing XDP traffic generator on ice, which calls
xdp_convert_frame_to_buff() for each XDP frame)
Fixes: b530e9e1063e ("bpf: Add "live packet" mode for XDP in BPF_PROG_RUN")
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://lore.kernel.org/r/20230215185440.4126672-1-aleksander.lobakin@intel.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
Add a new benchmark which measures hashmap lookup operations speed. A user can
control the following parameters of the benchmark:
* key_size (max 1024): the key size to use
* max_entries: the hashmap max entries
* nr_entries: the number of entries to insert/lookup
* nr_loops: the number of loops for the benchmark
* map_flags The hashmap flags passed to BPF_MAP_CREATE
The BPF program performing the benchmarks calls two nested bpf_loop:
bpf_loop(nr_loops/nr_entries)
bpf_loop(nr_entries)
bpf_map_lookup()
So the nr_loops determines the number of actual map lookups. All lookups are
successful.
Example (the output is generated on a AMD Ryzen 9 3950X machine):
for nr_entries in `seq 4096 4096 65536`; do echo -n "$((nr_entries*100/65536))% full: "; sudo ./bench -d2 -a bpf-hashmap-lookup --key_size=4 --nr_entries=$nr_entries --max_entries=65536 --nr_loops=1000000 --map_flags=0x40 | grep cpu; done
6% full: cpu01: lookup 50.739M ± 0.018M events/sec (approximated from 32 samples of ~19ms)
12% full: cpu01: lookup 47.751M ± 0.015M events/sec (approximated from 32 samples of ~20ms)
18% full: cpu01: lookup 45.153M ± 0.013M events/sec (approximated from 32 samples of ~22ms)
25% full: cpu01: lookup 43.826M ± 0.014M events/sec (approximated from 32 samples of ~22ms)
31% full: cpu01: lookup 41.971M ± 0.012M events/sec (approximated from 32 samples of ~23ms)
37% full: cpu01: lookup 41.034M ± 0.015M events/sec (approximated from 32 samples of ~24ms)
43% full: cpu01: lookup 39.946M ± 0.012M events/sec (approximated from 32 samples of ~25ms)
50% full: cpu01: lookup 38.256M ± 0.014M events/sec (approximated from 32 samples of ~26ms)
56% full: cpu01: lookup 36.580M ± 0.018M events/sec (approximated from 32 samples of ~27ms)
62% full: cpu01: lookup 36.252M ± 0.012M events/sec (approximated from 32 samples of ~27ms)
68% full: cpu01: lookup 35.200M ± 0.012M events/sec (approximated from 32 samples of ~28ms)
75% full: cpu01: lookup 34.061M ± 0.009M events/sec (approximated from 32 samples of ~29ms)
81% full: cpu01: lookup 34.374M ± 0.010M events/sec (approximated from 32 samples of ~29ms)
87% full: cpu01: lookup 33.244M ± 0.011M events/sec (approximated from 32 samples of ~30ms)
93% full: cpu01: lookup 32.182M ± 0.013M events/sec (approximated from 32 samples of ~31ms)
100% full: cpu01: lookup 31.497M ± 0.016M events/sec (approximated from 32 samples of ~31ms)
Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230213091519.1202813-8-aspsk@isovalent.com
|
|
The bench utility will print
Setting up benchmark '<bench-name>'...
Benchmark '<bench-name>' started.
on startup to stdout. Suppress this output if --quiet option if given. This
makes it simpler to parse benchmark output by a script.
Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230213091519.1202813-7-aspsk@isovalent.com
|
|
The "local-storage-tasks-trace" benchmark has a `--quiet` option. Move it to
the list of common options, so that the main code and other benchmarks can use
(new) env.quiet variable. Patch the run_bench_local_storage_rcu_tasks_trace.sh
helper script accordingly.
Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230213091519.1202813-6-aspsk@isovalent.com
|
|
The benchs/bench_bpf_hashmap_full_update.c doesn't set a custom argp,
so it shouldn't include the <argp.h> header.
Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230213091519.1202813-5-aspsk@isovalent.com
|
|
To parse command line the bench utility uses the argp_parse() function. This
function takes as an argument a parent 'struct argp' structure which defines
common command line options and an array of children 'struct argp' structures
which defines additional command line options for particular benchmarks. This
implementation doesn't allow benchmarks to share option names, e.g., if two
benchmarks want to use, say, the --option option, then only one of them will
succeed (the first one encountered in the array). This will be convenient if
same option names could be used in different benchmarks (with the same
semantics, e.g., --nr_loops=N).
Fix this by calling the argp_parse() function twice. The first call is the same
as it was before, with all children argps, and helps to find the benchmark name
and to print a combined help message if anything is wrong. Given the name, we
can call the argp_parse the second time, but now the children array points only
to a correct benchmark thus always calling the correct parsers. (If there's no
a specific list of arguments, then only one call to argp_parse will be done.)
Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230213091519.1202813-4-aspsk@isovalent.com
|
|
The hashmap_report_final callback function defined in the
benchs/bench_bpf_hashmap_full_update.c file should be static.
Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230213091519.1202813-3-aspsk@isovalent.com
|
|
To call the bpf_hashmap_full_update benchmark, one should say:
bench bpf-hashmap-ful-update
The patch adds a missing 'l' to the benchmark name.
Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230213091519.1202813-2-aspsk@isovalent.com
|
|
The reinitialization of spin-lock in map value after immediate reuse may
corrupt lookup with BPF_F_LOCK flag and result in hard lock-up, so add
one test case to demonstrate the problem.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20230215082132.3856544-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
A test case to verify that variable offset BPF_ST instruction
preserves STACK_ZERO marks when writes zeros, e.g. in the following
situation:
*(u64*)(r10 - 8) = 0 ; STACK_ZERO marks for fp[-8]
r0 = random(-7, -1) ; some random number in range of [-7, -1]
r0 += r10 ; r0 is now variable offset pointer to stack
*(u8*)(r0) = 0 ; BPF_ST writing zero, STACK_ZERO mark for
; fp[-8] should be preserved.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20230214232030.1502829-5-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Check that verifier tracks the value of 'imm' spilled to stack by
BPF_ST_MEM instruction. Cover the following cases:
- write of non-zero constant to stack;
- write of a zero constant to stack.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20230214232030.1502829-3-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
For aligned stack writes using BPF_ST instruction track stored values
in a same way BPF_STX is handled, e.g. make sure that the following
commands produce similar verifier knowledge:
fp[-8] = 42; r1 = 42;
fp[-8] = r1;
This covers two cases:
- non-null values written to stack are stored as spill of fake
registers;
- null values written to stack are stored as STACK_ZERO marks.
Previously both cases above used STACK_MISC marks instead.
Some verifier test cases relied on the old logic to obtain STACK_MISC
marks for some stack values. These test cases are updated in the same
commit to avoid failures during bisect.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20230214232030.1502829-2-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
The compiler is optimizing out majority of unref_ptr read/writes, so the test
wasn't testing much. For example, one could delete '__kptr' tag from
'struct prog_test_ref_kfunc __kptr *unref_ptr;' and the test would still "pass".
Convert it to volatile stores. Confirmed by comparing bpf asm before/after.
Fixes: 2cbc469a6fc3 ("selftests/bpf: Add C tests for kptr")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230214235051.22938-1-alexei.starovoitov@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
When the BPF selftests are cross-compiled, only the a host version of
bpftool is built. This version of bpftool is used on the host-side to
generate various intermediates, e.g., skeletons.
The test runners are also using bpftool, so the Makefile will symlink
bpftool from the selftest/bpf root, where the test runners will look
the tool:
| $(Q)ln -sf $(if $2,..,.)/tools/build/bpftool/bootstrap/bpftool \
| $(OUTPUT)/$(if $2,$2/)bpftool
There are two problems for cross-compilation builds:
1. There is no native (cross-compilation target) of bpftool
2. The bootstrap/bpftool is never cross-compiled (by design)
Make sure that a native/cross-compiled version of bpftool is built,
and if CROSS_COMPILE is set, symlink the native/non-bootstrap version.
Acked-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Björn Töpel <bjorn@rivosinc.com>
Link: https://lore.kernel.org/r/20230214161253.183458-1-bjorn@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Clean up prog_tests/dynptr.c by removing the unneeded "expected_err_msg"
in the dynptr_tests struct, which is a remnant from converting the fail
tests cases to use the generic verification tester.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://lore.kernel.org/r/20230214051332.4007131-2-joannelkoong@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Clean up user_ringbuf, cgrp_kfunc, and kfunc_dynptr_param tests to use
the generic verification tester for checking verifier rejections.
The generic verification tester uses btf_decl_tag-based annotations
for verifying that the tests fail with the expected log messages.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Acked-by: David Vernet <void@manifault.com>
Reviewed-by: Roberto Sassu <roberto.sassu@huawei.com>
Link: https://lore.kernel.org/r/20230214051332.4007131-1-joannelkoong@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
This patch adds selftests exercising the logic changed/added in the
previous patches in the series. A variety of successful and unsuccessful
rbtree usages are validated:
Success:
* Add some nodes, let map_value bpf_rbtree_root destructor clean them
up
* Add some nodes, remove one using the non-owning ref leftover by
successful rbtree_add() call
* Add some nodes, remove one using the non-owning ref returned by
rbtree_first() call
Failure:
* BTF where bpf_rb_root owns bpf_list_node should fail to load
* BTF where node of type X is added to tree containing nodes of type Y
should fail to load
* No calling rbtree api functions in 'less' callback for rbtree_add
* No releasing lock in 'less' callback for rbtree_add
* No removing a node which hasn't been added to any tree
* No adding a node which has already been added to a tree
* No escaping of non-owning references past their lock's
critical section
* No escaping of non-owning references past other invalidation points
(rbtree_remove)
These tests mostly focus on rbtree-specific additions, but some of the
failure cases revalidate scenarios common to both linked_list and rbtree
which are covered in the former's tests. Better to be a bit redundant in
case linked_list and rbtree semantics deviate over time.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230214004017.2534011-8-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
These kfuncs will be used by selftests in following patches
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230214004017.2534011-7-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Newly-added bpf_rbtree_{remove,first} kfuncs have some special properties
that require handling in the verifier:
* both bpf_rbtree_remove and bpf_rbtree_first return the type containing
the bpf_rb_node field, with the offset set to that field's offset,
instead of a struct bpf_rb_node *
* mark_reg_graph_node helper added in previous patch generalizes
this logic, use it
* bpf_rbtree_remove's node input is a node that's been inserted
in the tree - a non-owning reference.
* bpf_rbtree_remove must invalidate non-owning references in order to
avoid aliasing issue. Use previously-added
invalidate_non_owning_refs helper to mark this function as a
non-owning ref invalidation point.
* Unlike other functions, which convert one of their input arg regs to
non-owning reference, bpf_rbtree_first takes no arguments and just
returns a non-owning reference (possibly null)
* For now verifier logic for this is special-cased instead of
adding new kfunc flag.
This patch, along with the previous one, complete special verifier
handling for all rbtree API functions added in this series.
With functional verifier handling of rbtree_remove, under current
non-owning reference scheme, a node type with both bpf_{list,rb}_node
fields could cause the verifier to accept programs which remove such
nodes from collections they haven't been added to.
In order to prevent this, this patch adds a check to btf_parse_fields
which rejects structs with both bpf_{list,rb}_node fields. This is a
temporary measure that can be removed after "collection identity"
followup. See comment added in btf_parse_fields. A linked_list BTF test
exercising the new check is added in this patch as well.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230214004017.2534011-6-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
This patch adds special BPF_RB_{ROOT,NODE} btf_field_types similar to
BPF_LIST_{HEAD,NODE}, adds the necessary plumbing to detect the new
types, and adds bpf_rb_root_free function for freeing bpf_rb_root in
map_values.
structs bpf_rb_root and bpf_rb_node are opaque types meant to
obscure structs rb_root_cached rb_node, respectively.
btf_struct_access will prevent BPF programs from touching these special
fields automatically now that they're recognized.
btf_check_and_fixup_fields now groups list_head and rb_root together as
"graph root" fields and {list,rb}_node as "graph node", and does same
ownership cycle checking as before. Note that this function does _not_
prevent ownership type mixups (e.g. rb_root owning list_node) - that's
handled by btf_parse_graph_root.
After this patch, a bpf program can have a struct bpf_rb_root in a
map_value, but not add anything to nor do anything useful with it.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230214004017.2534011-2-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
This patch introduces non-owning reference semantics to the verifier,
specifically linked_list API kfunc handling. release_on_unlock logic for
refs is refactored - with small functional changes - to implement these
semantics, and bpf_list_push_{front,back} are migrated to use them.
When a list node is pushed to a list, the program still has a pointer to
the node:
n = bpf_obj_new(typeof(*n));
bpf_spin_lock(&l);
bpf_list_push_back(&l, n);
/* n still points to the just-added node */
bpf_spin_unlock(&l);
What the verifier considers n to be after the push, and thus what can be
done with n, are changed by this patch.
Common properties both before/after this patch:
* After push, n is only a valid reference to the node until end of
critical section
* After push, n cannot be pushed to any list
* After push, the program can read the node's fields using n
Before:
* After push, n retains the ref_obj_id which it received on
bpf_obj_new, but the associated bpf_reference_state's
release_on_unlock field is set to true
* release_on_unlock field and associated logic is used to implement
"n is only a valid ref until end of critical section"
* After push, n cannot be written to, the node must be removed from
the list before writing to its fields
* After push, n is marked PTR_UNTRUSTED
After:
* After push, n's ref is released and ref_obj_id set to 0. NON_OWN_REF
type flag is added to reg's type, indicating that it's a non-owning
reference.
* NON_OWN_REF flag and logic is used to implement "n is only a
valid ref until end of critical section"
* n can be written to (except for special fields e.g. bpf_list_node,
timer, ...)
Summary of specific implementation changes to achieve the above:
* release_on_unlock field, ref_set_release_on_unlock helper, and logic
to "release on unlock" based on that field are removed
* The anonymous active_lock struct used by bpf_verifier_state is
pulled out into a named struct bpf_active_lock.
* NON_OWN_REF type flag is introduced along with verifier logic
changes to handle non-owning refs
* Helpers are added to use NON_OWN_REF flag to implement non-owning
ref semantics as described above
* invalidate_non_owning_refs - helper to clobber all non-owning refs
matching a particular bpf_active_lock identity. Replaces
release_on_unlock logic in process_spin_lock.
* ref_set_non_owning - set NON_OWN_REF type flag after doing some
sanity checking
* ref_convert_owning_non_owning - convert owning reference w/
specified ref_obj_id to non-owning references. Set NON_OWN_REF
flag for each reg with that ref_obj_id and 0-out its ref_obj_id
* Update linked_list selftests to account for minor semantic
differences introduced by this patch
* Writes to a release_on_unlock node ref are not allowed, while
writes to non-owning reference pointees are. As a result the
linked_list "write after push" failure tests are no longer scenarios
that should fail.
* The test##missing_lock##op and test##incorrect_lock##op
macro-generated failure tests need to have a valid node argument in
order to have the same error output as before. Otherwise
verification will fail early and the expected error output won't be seen.
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230212092715.1422619-2-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Building BPF selftests out of srctree fails with:
make: *** No rule to make target '/linux-build//ima_setup.sh', needed by 'ima_setup.sh'. Stop.
The culprit is the rule that defines convenient shorthands like
"make test_progs", which builds $(OUTPUT)/test_progs. These shorthands
make sense only for binaries that are built though; scripts that live
in the source tree do not end up in $(OUTPUT).
Therefore drop $(TEST_PROGS) and $(TEST_PROGS_EXTENDED) from the rule.
The issue exists for a while, but it became a problem only after commit
d68ae4982cb7 ("selftests/bpf: Install all required files to run selftests"),
which added dependencies on these scripts.
Fixes: 03dcb78460c2 ("selftests/bpf: Add simple per-test targets to Makefile")
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20230208231211.283606-1-iii@linux.ibm.com
|
|
====================
pull-request: bpf-next 2023-02-11
We've added 96 non-merge commits during the last 14 day(s) which contain
a total of 152 files changed, 4884 insertions(+), 962 deletions(-).
There is a minor conflict in drivers/net/ethernet/intel/ice/ice_main.c
between commit 5b246e533d01 ("ice: split probe into smaller functions")
from the net-next tree and commit 66c0e13ad236 ("drivers: net: turn on
XDP features") from the bpf-next tree. Remove the hunk given ice_cfg_netdev()
is otherwise there a 2nd time, and add XDP features to the existing
ice_cfg_netdev() one:
[...]
ice_set_netdev_features(netdev);
netdev->xdp_features = NETDEV_XDP_ACT_BASIC | NETDEV_XDP_ACT_REDIRECT |
NETDEV_XDP_ACT_XSK_ZEROCOPY;
ice_set_ops(netdev);
[...]
Stephen's merge conflict mail:
https://lore.kernel.org/bpf/20230207101951.21a114fa@canb.auug.org.au/
The main changes are:
1) Add support for BPF trampoline on s390x which finally allows to remove many
test cases from the BPF CI's DENYLIST.s390x, from Ilya Leoshkevich.
2) Add multi-buffer XDP support to ice driver, from Maciej Fijalkowski.
3) Add capability to export the XDP features supported by the NIC.
Along with that, add a XDP compliance test tool,
from Lorenzo Bianconi & Marek Majtyka.
4) Add __bpf_kfunc tag for marking kernel functions as kfuncs,
from David Vernet.
5) Add a deep dive documentation about the verifier's register
liveness tracking algorithm, from Eduard Zingerman.
6) Fix and follow-up cleanups for resolve_btfids to be compiled
as a host program to avoid cross compile issues,
from Jiri Olsa & Ian Rogers.
7) Batch of fixes to the BPF selftest for xdp_hw_metadata which resulted
when testing on different NICs, from Jesper Dangaard Brouer.
8) Fix libbpf to better detect kernel version code on Debian, from Hao Xiang.
9) Extend libbpf to add an option for when the perf buffer should
wake up, from Jon Doron.
10) Follow-up fix on xdp_metadata selftest to just consume on TX
completion, from Stanislav Fomichev.
11) Extend the kfuncs.rst document with description on kfunc
lifecycle & stability expectations, from David Vernet.
12) Fix bpftool prog profile to skip attaching to offline CPUs,
from Tonghao Zhang.
====================
Link: https://lore.kernel.org/r/20230211002037.8489-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
malloc() and free() may be completely replaced by sanitizers, use
fopen() and fclose() instead.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230210001210.395194-7-iii@linux.ibm.com
|
|
malloc() and free() may be completely replaced by sanitizers, use
fopen() and fclose() instead.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230210001210.395194-6-iii@linux.ibm.com
|
|
To get useful results from the Memory Sanitizer, all code running in a
process needs to be instrumented. When building tests with other
sanitizers, it's not strictly necessary, but is also helpful.
So make sure runqslower and libbpf are compiled with SAN_CFLAGS and
linked with SAN_LDFLAGS.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230210001210.395194-5-iii@linux.ibm.com
|
|
Memory Sanitizer requires passing different options to CFLAGS and
LDFLAGS: besides the mandatory -fsanitize=memory, one needs to pass
header and library paths, and passing -L to a compilation step
triggers -Wunused-command-line-argument. So introduce a separate
variable for linker flags. Use $(SAN_CFLAGS) as a default in order to
avoid complicating the ASan usage.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230210001210.395194-4-iii@linux.ibm.com
|
|
Using HOSTCC="ccache clang" breaks building the tests, since, when it's
forwarded to e.g. bpftool, the child make sees HOSTCC=ccache and
"clang" is considered a target. Fix by quoting it, and also HOSTLD and
HOSTAR for consistency.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230210001210.395194-2-iii@linux.ibm.com
|
|
There is a spelling mistake in a literal string. Fix it.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20230206092229.46416-1-colin.i.king@gmail.com
|
|
Introduce xdp_features tool in order to test XDP features supported by
the NIC and match them against advertised ones.
In order to test supported/advertised XDP features, xdp_features must
run on the Device Under Test (DUT) and on a Tester device.
xdp_features opens a control TCP channel between DUT and Tester devices
to send control commands from Tester to the DUT and a UDP data channel
where the Tester sends UDP 'echo' packets and the DUT is expected to
reply back with the same packet. DUT installs multiple XDP programs on the
NIC to test XDP capabilities and reports back to the Tester some XDP stats.
Currently xdp_features supports the following XDP features:
- XDP_DROP
- XDP_ABORTED
- XDP_PASS
- XDP_TX
- XDP_REDIRECT
- XDP_NDO_XMIT
Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/7c1af8e7e6ef0614cf32fa9e6bdaa2d8d605f859.1675245258.git.lorenzo@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|