diff options
Diffstat (limited to 'Documentation/networking/af_xdp.rst')
| -rw-r--r-- | Documentation/networking/af_xdp.rst | 272 |
1 files changed, 248 insertions, 24 deletions
diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst index 60b217b436be..50d92084a49c 100644 --- a/Documentation/networking/af_xdp.rst +++ b/Documentation/networking/af_xdp.rst @@ -209,13 +209,10 @@ Libbpf Libbpf is a helper library for eBPF and XDP that makes using these technologies a lot simpler. It also contains specific helper functions -in tools/lib/bpf/xsk.h for facilitating the use of AF_XDP. It -contains two types of functions: those that can be used to make the -setup of AF_XDP socket easier and ones that can be used in the data -plane to access the rings safely and quickly. To see an example on how -to use this API, please take a look at the sample application in -samples/bpf/xdpsock_usr.c which uses libbpf for both setup and data -plane operations. +in tools/testing/selftests/bpf/xsk.h for facilitating the use of +AF_XDP. It contains two types of functions: those that can be used to +make the setup of AF_XDP socket easier and ones that can be used in the +data plane to access the rings safely and quickly. We recommend that you use this library unless you have become a power user. It will make your program a lot simpler. @@ -372,9 +369,8 @@ needs to explicitly notify the kernel to send any packets put on the TX ring. This can be accomplished either by a poll() call, as in the RX path, or by calling sendto(). -An example of how to use this flag can be found in -samples/bpf/xdpsock_user.c. An example with the use of libbpf helpers -would look like this for the TX path: +An example with the use of libbpf helpers would look like this for the +TX path: .. code-block:: c @@ -419,7 +415,7 @@ XDP_UMEM_REG setsockopt ----------------------- This setsockopt registers a UMEM to a socket. This is the area that -contain all the buffers that packet can recide in. The call takes a +contain all the buffers that packet can reside in. The call takes a pointer to the beginning of this area and the size of it. Moreover, it also has parameter called chunk_size that is the size that the UMEM is divided into. It can only be 2K or 4K at the moment. If you have an @@ -433,6 +429,24 @@ start N bytes into the buffer leaving the first N bytes for the application to use. The final option is the flags field, but it will be dealt with in separate sections for each UMEM flag. +SO_BINDTODEVICE setsockopt +-------------------------- + +This is a generic SOL_SOCKET option that can be used to tie AF_XDP +socket to a particular network interface. It is useful when a socket +is created by a privileged process and passed to a non-privileged one. +Once the option is set, kernel will refuse attempts to bind that socket +to a different interface. Updating the value requires CAP_NET_RAW. + +XDP_MAX_TX_SKB_BUDGET setsockopt +-------------------------------- + +This setsockopt sets the maximum number of descriptors that can be handled +and passed to the driver at one send syscall. It is applied in the copy +mode to allow application to tune the per-socket maximum iteration for +better throughput and less frequency of send syscall. +Allowed range is [32, xs->tx->nentries]. + XDP_STATISTICS getsockopt ------------------------- @@ -453,15 +467,99 @@ XDP_OPTIONS getsockopt Gets options from an XDP socket. The only one supported so far is XDP_OPTIONS_ZEROCOPY which tells you if zero-copy is on or not. +Multi-Buffer Support +==================== + +With multi-buffer support, programs using AF_XDP sockets can receive +and transmit packets consisting of multiple buffers both in copy and +zero-copy mode. For example, a packet can consist of two +frames/buffers, one with the header and the other one with the data, +or a 9K Ethernet jumbo frame can be constructed by chaining together +three 4K frames. + +Some definitions: + +* A packet consists of one or more frames + +* A descriptor in one of the AF_XDP rings always refers to a single + frame. In the case the packet consists of a single frame, the + descriptor refers to the whole packet. + +To enable multi-buffer support for an AF_XDP socket, use the new bind +flag XDP_USE_SG. If this is not provided, all multi-buffer packets +will be dropped just as before. Note that the XDP program loaded also +needs to be in multi-buffer mode. This can be accomplished by using +"xdp.frags" as the section name of the XDP program used. + +To represent a packet consisting of multiple frames, a new flag called +XDP_PKT_CONTD is introduced in the options field of the Rx and Tx +descriptors. If it is true (1) the packet continues with the next +descriptor and if it is false (0) it means this is the last descriptor +of the packet. Why the reverse logic of end-of-packet (eop) flag found +in many NICs? Just to preserve compatibility with non-multi-buffer +applications that have this bit set to false for all packets on Rx, +and the apps set the options field to zero for Tx, as anything else +will be treated as an invalid descriptor. + +These are the semantics for producing packets onto AF_XDP Tx ring +consisting of multiple frames: + +* When an invalid descriptor is found, all the other + descriptors/frames of this packet are marked as invalid and not + completed. The next descriptor is treated as the start of a new + packet, even if this was not the intent (because we cannot guess + the intent). As before, if your program is producing invalid + descriptors you have a bug that must be fixed. + +* Zero length descriptors are treated as invalid descriptors. + +* For copy mode, the maximum supported number of frames in a packet is + equal to CONFIG_MAX_SKB_FRAGS + 1. If it is exceeded, all + descriptors accumulated so far are dropped and treated as + invalid. To produce an application that will work on any system + regardless of this config setting, limit the number of frags to 18, + as the minimum value of the config is 17. + +* For zero-copy mode, the limit is up to what the NIC HW + supports. Usually at least five on the NICs we have checked. We + consciously chose to not enforce a rigid limit (such as + CONFIG_MAX_SKB_FRAGS + 1) for zero-copy mode, as it would have + resulted in copy actions under the hood to fit into what limit the + NIC supports. Kind of defeats the purpose of zero-copy mode. How to + probe for this limit is explained in the "probe for multi-buffer + support" section. + +On the Rx path in copy-mode, the xsk core copies the XDP data into +multiple descriptors, if needed, and sets the XDP_PKT_CONTD flag as +detailed before. Zero-copy mode works the same, though the data is not +copied. When the application gets a descriptor with the XDP_PKT_CONTD +flag set to one, it means that the packet consists of multiple buffers +and it continues with the next buffer in the following +descriptor. When a descriptor with XDP_PKT_CONTD == 0 is received, it +means that this is the last buffer of the packet. AF_XDP guarantees +that only a complete packet (all frames in the packet) is sent to the +application. If there is not enough space in the AF_XDP Rx ring, all +frames of the packet will be dropped. + +If application reads a batch of descriptors, using for example the libxdp +interfaces, it is not guaranteed that the batch will end with a full +packet. It might end in the middle of a packet and the rest of the +buffers of that packet will arrive at the beginning of the next batch, +since the libxdp interface does not read the whole ring (unless you +have an enormous batch size or a very small ring size). + +An example program each for Rx and Tx multi-buffer support can be found +later in this document. + Usage -===== +----- -In order to use AF_XDP sockets two parts are needed. The -user-space application and the XDP program. For a complete setup and -usage example, please refer to the sample application. The user-space -side is xdpsock_user.c and the XDP side is part of libbpf. +In order to use AF_XDP sockets two parts are needed. The user-space +application and the XDP program. For a complete setup and usage example, +please refer to the xdp-project at +https://github.com/xdp-project/bpf-examples/tree/main/AF_XDP-example. -The XDP code sample included in tools/lib/bpf/xsk.c is the following: +The XDP code sample is the following: .. code-block:: c @@ -532,13 +630,139 @@ like this: But please use the libbpf functions as they are optimized and ready to use. Will make your life easier. +Usage Multi-Buffer Rx +--------------------- + +Here is a simple Rx path pseudo-code example (using libxdp interfaces +for simplicity). Error paths have been excluded to keep it short: + +.. code-block:: c + + void rx_packets(struct xsk_socket_info *xsk) + { + static bool new_packet = true; + u32 idx_rx = 0, idx_fq = 0; + static char *pkt; + + int rcvd = xsk_ring_cons__peek(&xsk->rx, opt_batch_size, &idx_rx); + + xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq); + + for (int i = 0; i < rcvd; i++) { + struct xdp_desc *desc = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx++); + char *frag = xsk_umem__get_data(xsk->umem->buffer, desc->addr); + bool eop = !(desc->options & XDP_PKT_CONTD); + + if (new_packet) + pkt = frag; + else + add_frag_to_pkt(pkt, frag); + + if (eop) + process_pkt(pkt); + + new_packet = eop; + + *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq++) = desc->addr; + } + + xsk_ring_prod__submit(&xsk->umem->fq, rcvd); + xsk_ring_cons__release(&xsk->rx, rcvd); + } + +Usage Multi-Buffer Tx +--------------------- + +Here is an example Tx path pseudo-code (using libxdp interfaces for +simplicity) ignoring that the umem is finite in size, and that we +eventually will run out of packets to send. Also assumes pkts.addr +points to a valid location in the umem. + +.. code-block:: c + + void tx_packets(struct xsk_socket_info *xsk, struct pkt *pkts, + int batch_size) + { + u32 idx, i, pkt_nb = 0; + + xsk_ring_prod__reserve(&xsk->tx, batch_size, &idx); + + for (i = 0; i < batch_size;) { + u64 addr = pkts[pkt_nb].addr; + u32 len = pkts[pkt_nb].size; + + do { + struct xdp_desc *tx_desc; + + tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx + i++); + tx_desc->addr = addr; + + if (len > xsk_frame_size) { + tx_desc->len = xsk_frame_size; + tx_desc->options = XDP_PKT_CONTD; + } else { + tx_desc->len = len; + tx_desc->options = 0; + pkt_nb++; + } + len -= tx_desc->len; + addr += xsk_frame_size; + + if (i == batch_size) { + /* Remember len, addr, pkt_nb for next iteration. + * Skipped for simplicity. + */ + break; + } + } while (len); + } + + xsk_ring_prod__submit(&xsk->tx, i); + } + +Probing for Multi-Buffer Support +-------------------------------- + +To discover if a driver supports multi-buffer AF_XDP in SKB or DRV +mode, use the XDP_FEATURES feature of netlink in linux/netdev.h to +query for NETDEV_XDP_ACT_RX_SG support. This is the same flag as for +querying for XDP multi-buffer support. If XDP supports multi-buffer in +a driver, then AF_XDP will also support that in SKB and DRV mode. + +To discover if a driver supports multi-buffer AF_XDP in zero-copy +mode, use XDP_FEATURES and first check the NETDEV_XDP_ACT_XSK_ZEROCOPY +flag. If it is set, it means that at least zero-copy is supported and +you should go and check the netlink attribute +NETDEV_A_DEV_XDP_ZC_MAX_SEGS in linux/netdev.h. An unsigned integer +value will be returned stating the max number of frags that are +supported by this device in zero-copy mode. These are the possible +return values: + +1: Multi-buffer for zero-copy is not supported by this device, as max + one fragment supported means that multi-buffer is not possible. + +>=2: Multi-buffer is supported in zero-copy mode for this device. The + returned number signifies the max number of frags supported. + +For an example on how these are used through libbpf, please take a +look at tools/testing/selftests/bpf/xskxceiver.c. + +Multi-Buffer Support for Zero-Copy Drivers +------------------------------------------ + +Zero-copy drivers usually use the batched APIs for Rx and Tx +processing. Note that the Tx batch API guarantees that it will provide +a batch of Tx descriptors that ends with full packet at the end. This +to facilitate extending a zero-copy driver with multi-buffer support. + Sample application ================== - -There is a xdpsock benchmarking/test application included that -demonstrates how to use AF_XDP sockets with private UMEMs. Say that -you would like your UDP traffic from port 4242 to end up in queue 16, -that we will enable AF_XDP on. Here, we use ethtool for this:: +There is a xdpsock benchmarking/test application that can be found at +https://github.com/xdp-project/bpf-examples/tree/main/AF_XDP-example +that demonstrates how to use AF_XDP sockets with private +UMEMs. Say that you would like your UDP traffic from port 4242 to end +up in queue 16, that we will enable AF_XDP on. Here, we use ethtool +for this:: ethtool -N p3p2 rx-flow-hash udp4 fn ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \ @@ -555,7 +779,7 @@ can be displayed with "-h", as usual. This sample application uses libbpf to make the setup and usage of AF_XDP simpler. If you want to know how the raw uapi of AF_XDP is really used to make something more advanced, take a look at the libbpf -code in tools/lib/bpf/xsk.[ch]. +code in tools/testing/selftests/bpf/xsk.[ch]. FAQ ======= @@ -592,7 +816,7 @@ A: When a netdev of a physical NIC is initialized, Linux usually A number of other ways are possible all up to the capabilities of the NIC you have. -Q: Can I use the XSKMAP to implement a switch betwen different umems +Q: Can I use the XSKMAP to implement a switch between different umems in copy mode? A: The short answer is no, that is not supported at the moment. The |
