1 files changed, 303 insertions, 32 deletions
diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
index 5bc55a4e3bce..72da7057e4cf 100644
--- a/Documentation/networking/af_xdp.rst
+++ b/Documentation/networking/af_xdp.rst
@@ -243,8 +243,8 @@ Configuration Flags and Socket Options
 These are the various configuration flags that can be used to control
 and monitor the behavior of AF_XDP sockets.
 
-XDP_COPY and XDP_ZERO_COPY bind flags
--------------------------------------
+XDP_COPY and XDP_ZEROCOPY bind flags
+------------------------------------
 
 When you bind to a socket, the kernel will first try to use zero-copy
 copy. If zero-copy is not supported, it will fall back on using copy
@@ -252,20 +252,27 @@ mode, i.e. copying all packets out to user space. But if you would
 like to force a certain mode, you can use the following flags. If you
 pass the XDP_COPY flag to the bind call, the kernel will force the
 socket into copy mode. If it cannot use copy mode, the bind call will
-fail with an error. Conversely, the XDP_ZERO_COPY flag will force the
+fail with an error. Conversely, the XDP_ZEROCOPY flag will force the
 socket into zero-copy mode or fail.
 
 XDP_SHARED_UMEM bind flag
 -------------------------
 
-This flag enables you to bind multiple sockets to the same UMEM, but
-only if they share the same queue id. In this mode, each socket has
-their own RX and TX rings, but the UMEM (tied to the fist socket
-created) only has a single FILL ring and a single COMPLETION
-ring. To use this mode, create the first socket and bind it in the normal
-way. Create a second socket and create an RX and a TX ring, or at
-least one of them, but no FILL or COMPLETION rings as the ones from
-the first socket will be used. In the bind call, set he
+This flag enables you to bind multiple sockets to the same UMEM. It
+works on the same queue id, between queue ids and between
+netdevs/devices. In this mode, each socket has their own RX and TX
+rings as usual, but you are going to have one or more FILL and
+COMPLETION ring pairs. You have to create one of these pairs per
+unique netdev and queue id tuple that you bind to.
+
+Starting with the case were we would like to share a UMEM between
+sockets bound to the same netdev and queue id. The UMEM (tied to the
+fist socket created) will only have a single FILL ring and a single
+COMPLETION ring as there is only on unique netdev,queue_id tuple that
+we have bound to. To use this mode, create the first socket and bind
+it in the normal way. Create a second socket and create an RX and a TX
+ring, or at least one of them, but no FILL or COMPLETION rings as the
+ones from the first socket will be used. In the bind call, set he
 XDP_SHARED_UMEM option and provide the initial socket's fd in the
 sxdp_shared_umem_fd field. You can attach an arbitrary number of extra
 sockets this way.
@@ -283,19 +290,19 @@ round-robin example of distributing packets is shown below:
    #define MAX_SOCKS 16
 
    struct {
-        __uint(type, BPF_MAP_TYPE_XSKMAP);
-        __uint(max_entries, MAX_SOCKS);
-        __uint(key_size, sizeof(int));
-        __uint(value_size, sizeof(int));
+       __uint(type, BPF_MAP_TYPE_XSKMAP);
+       __uint(max_entries, MAX_SOCKS);
+       __uint(key_size, sizeof(int));
+       __uint(value_size, sizeof(int));
    } xsks_map SEC(".maps");
 
    static unsigned int rr;
 
    SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx)
    {
-	rr = (rr + 1) & (MAX_SOCKS - 1);
+       rr = (rr + 1) & (MAX_SOCKS - 1);
 
-	return bpf_redirect_map(&xsks_map, rr, XDP_DROP);
+       return bpf_redirect_map(&xsks_map, rr, XDP_DROP);
    }
 
 Note, that since there is only a single set of FILL and COMPLETION
@@ -305,11 +312,42 @@ concurrently. There are no synchronization primitives in the
 libbpf code that protects multiple users at this point in time.
 
 Libbpf uses this mode if you create more than one socket tied to the
-same umem. However, note that you need to supply the
+same UMEM. However, note that you need to supply the
 XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD libbpf_flag with the
 xsk_socket__create calls and load your own XDP program as there is no
 built in one in libbpf that will route the traffic for you.
 
+The second case is when you share a UMEM between sockets that are
+bound to different queue ids and/or netdevs. In this case you have to
+create one FILL ring and one COMPLETION ring for each unique
+netdev,queue_id pair. Let us say you want to create two sockets bound
+to two different queue ids on the same netdev. Create the first socket
+and bind it in the normal way. Create a second socket and create an RX
+and a TX ring, or at least one of them, and then one FILL and
+COMPLETION ring for this socket. Then in the bind call, set he
+XDP_SHARED_UMEM option and provide the initial socket's fd in the
+sxdp_shared_umem_fd field as you registered the UMEM on that
+socket. These two sockets will now share one and the same UMEM.
+
+In this case, it is possible to use the NIC's packet steering
+capabilities to steer the packets to the right queue. This is not
+possible in the previous example as there is only one queue shared
+among sockets, so the NIC cannot do this steering as it can only steer
+between queues.
+
+In libxdp (or libbpf prior to version 1.0), you need to use the
+xsk_socket__create_shared() API as it takes a reference to a FILL ring
+and a COMPLETION ring that will be created for you and bound to the
+shared UMEM. You can use this function for all the sockets you create,
+or you can use it for the second and following ones and use
+xsk_socket__create() for the first one. Both methods yield the same
+result.
+
+Note that a UMEM can be shared between sockets on the same queue id
+and device, as well as between queues on the same device and between
+devices at the same time. It is also possible to redirect to any
+socket as long as it is bound to the same umem with XDP_SHARED_UMEM.
+
 XDP_USE_NEED_WAKEUP bind flag
 -----------------------------
 
@@ -342,7 +380,7 @@ would look like this for the TX path:
 .. code-block:: c
 
    if (xsk_ring_prod__needs_wakeup(&my_tx_ring))
-      sendto(xsk_socket__fd(xsk_handle), NULL, 0, MSG_DONTWAIT, NULL, 0);
+       sendto(xsk_socket__fd(xsk_handle), NULL, 0, MSG_DONTWAIT, NULL, 0);
 
 I.e., only use the syscall if the flag is set.
 
@@ -364,7 +402,7 @@ resources by only setting up one of them. Both the FILL ring and the
 COMPLETION ring are mandatory as you need to have a UMEM tied to your
 socket. But if the XDP_SHARED_UMEM flag is used, any socket after the
 first one does not have a UMEM and should in that case not have any
-FILL or COMPLETION rings created as the ones from the shared umem will
+FILL or COMPLETION rings created as the ones from the shared UMEM will
 be used. Note, that the rings are single-producer single-consumer, so
 do not try to access them from multiple processes at the same
 time. See the XDP_SHARED_UMEM section.
@@ -382,7 +420,7 @@ XDP_UMEM_REG setsockopt
 -----------------------
 
 This setsockopt registers a UMEM to a socket. This is the area that
-contain all the buffers that packet can recide in. The call takes a
+contain all the buffers that packet can reside in. The call takes a
 pointer to the beginning of this area and the size of it. Moreover, it
 also has parameter called chunk_size that is the size that the UMEM is
 divided into. It can only be 2K or 4K at the moment. If you have an
@@ -396,6 +434,15 @@ start N bytes into the buffer leaving the first N bytes for the
 application to use. The final option is the flags field, but it will
 be dealt with in separate sections for each UMEM flag.
 
+SO_BINDTODEVICE setsockopt
+--------------------------
+
+This is a generic SOL_SOCKET option that can be used to tie AF_XDP
+socket to a particular network interface.  It is useful when a socket
+is created by a privileged process and passed to a non-privileged one.
+Once the option is set, kernel will refuse attempts to bind that socket
+to a different interface.  Updating the value requires CAP_NET_RAW.
+
 XDP_STATISTICS getsockopt
 -------------------------
 
@@ -405,9 +452,9 @@ purposes. The supported statistics are shown below:
 .. code-block:: c
 
    struct xdp_statistics {
-	  __u64 rx_dropped; /* Dropped for reasons other than invalid desc */
-	  __u64 rx_invalid_descs; /* Dropped due to invalid descriptor */
-	  __u64 tx_invalid_descs; /* Dropped due to invalid descriptor */
+       __u64 rx_dropped; /* Dropped for reasons other than invalid desc */
+       __u64 rx_invalid_descs; /* Dropped due to invalid descriptor */
+       __u64 tx_invalid_descs; /* Dropped due to invalid descriptor */
    };
 
 XDP_OPTIONS getsockopt
@@ -416,8 +463,92 @@ XDP_OPTIONS getsockopt
 Gets options from an XDP socket. The only one supported so far is
 XDP_OPTIONS_ZEROCOPY which tells you if zero-copy is on or not.
 
+Multi-Buffer Support
+====================
+
+With multi-buffer support, programs using AF_XDP sockets can receive
+and transmit packets consisting of multiple buffers both in copy and
+zero-copy mode. For example, a packet can consist of two
+frames/buffers, one with the header and the other one with the data,
+or a 9K Ethernet jumbo frame can be constructed by chaining together
+three 4K frames.
+
+Some definitions:
+
+* A packet consists of one or more frames
+
+* A descriptor in one of the AF_XDP rings always refers to a single
+  frame. In the case the packet consists of a single frame, the
+  descriptor refers to the whole packet.
+
+To enable multi-buffer support for an AF_XDP socket, use the new bind
+flag XDP_USE_SG. If this is not provided, all multi-buffer packets
+will be dropped just as before. Note that the XDP program loaded also
+needs to be in multi-buffer mode. This can be accomplished by using
+"xdp.frags" as the section name of the XDP program used.
+
+To represent a packet consisting of multiple frames, a new flag called
+XDP_PKT_CONTD is introduced in the options field of the Rx and Tx
+descriptors. If it is true (1) the packet continues with the next
+descriptor and if it is false (0) it means this is the last descriptor
+of the packet. Why the reverse logic of end-of-packet (eop) flag found
+in many NICs? Just to preserve compatibility with non-multi-buffer
+applications that have this bit set to false for all packets on Rx,
+and the apps set the options field to zero for Tx, as anything else
+will be treated as an invalid descriptor.
+
+These are the semantics for producing packets onto AF_XDP Tx ring
+consisting of multiple frames:
+
+* When an invalid descriptor is found, all the other
+  descriptors/frames of this packet are marked as invalid and not
+  completed. The next descriptor is treated as the start of a new
+  packet, even if this was not the intent (because we cannot guess
+  the intent). As before, if your program is producing invalid
+  descriptors you have a bug that must be fixed.
+
+* Zero length descriptors are treated as invalid descriptors.
+
+* For copy mode, the maximum supported number of frames in a packet is
+  equal to CONFIG_MAX_SKB_FRAGS + 1. If it is exceeded, all
+  descriptors accumulated so far are dropped and treated as
+  invalid. To produce an application that will work on any system
+  regardless of this config setting, limit the number of frags to 18,
+  as the minimum value of the config is 17.
+
+* For zero-copy mode, the limit is up to what the NIC HW
+  supports. Usually at least five on the NICs we have checked. We
+  consciously chose to not enforce a rigid limit (such as
+  CONFIG_MAX_SKB_FRAGS + 1) for zero-copy mode, as it would have
+  resulted in copy actions under the hood to fit into what limit the
+  NIC supports. Kind of defeats the purpose of zero-copy mode. How to
+  probe for this limit is explained in the "probe for multi-buffer
+  support" section.
+
+On the Rx path in copy-mode, the xsk core copies the XDP data into
+multiple descriptors, if needed, and sets the XDP_PKT_CONTD flag as
+detailed before. Zero-copy mode works the same, though the data is not
+copied. When the application gets a descriptor with the XDP_PKT_CONTD
+flag set to one, it means that the packet consists of multiple buffers
+and it continues with the next buffer in the following
+descriptor. When a descriptor with XDP_PKT_CONTD == 0 is received, it
+means that this is the last buffer of the packet. AF_XDP guarantees
+that only a complete packet (all frames in the packet) is sent to the
+application. If there is not enough space in the AF_XDP Rx ring, all
+frames of the packet will be dropped.
+
+If application reads a batch of descriptors, using for example the libxdp
+interfaces, it is not guaranteed that the batch will end with a full
+packet. It might end in the middle of a packet and the rest of the
+buffers of that packet will arrive at the beginning of the next batch,
+since the libxdp interface does not read the whole ring (unless you
+have an enormous batch size or a very small ring size).
+
+An example program each for Rx and Tx multi-buffer support can be found
+later in this document.
+
 Usage
-=====
+-----
 
 In order to use AF_XDP sockets two parts are needed. The
 user-space application and the XDP program. For a complete setup and
@@ -446,15 +577,15 @@ like this:
 .. code-block:: c
 
     // struct xdp_rxtx_ring {
-    // 	__u32 *producer;
-    // 	__u32 *consumer;
-    // 	struct xdp_desc *desc;
+    //     __u32 *producer;
+    //     __u32 *consumer;
+    //     struct xdp_desc *desc;
     // };
 
     // struct xdp_umem_ring {
-    // 	__u32 *producer;
-    // 	__u32 *consumer;
-    // 	__u64 *desc;
+    //     __u32 *producer;
+    //     __u32 *consumer;
+    //     __u64 *desc;
     // };
 
     // typedef struct xdp_rxtx_ring RING;
@@ -495,6 +626,131 @@ like this:
 But please use the libbpf functions as they are optimized and ready to
 use. Will make your life easier.
 
+Usage Multi-Buffer Rx
+---------------------
+
+Here is a simple Rx path pseudo-code example (using libxdp interfaces
+for simplicity). Error paths have been excluded to keep it short:
+
+.. code-block:: c
+
+    void rx_packets(struct xsk_socket_info *xsk)
+    {
+        static bool new_packet = true;
+        u32 idx_rx = 0, idx_fq = 0;
+        static char *pkt;
+
+        int rcvd = xsk_ring_cons__peek(&xsk->rx, opt_batch_size, &idx_rx);
+
+        xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq);
+
+        for (int i = 0; i < rcvd; i++) {
+            struct xdp_desc *desc = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx++);
+            char *frag = xsk_umem__get_data(xsk->umem->buffer, desc->addr);
+            bool eop = !(desc->options & XDP_PKT_CONTD);
+
+            if (new_packet)
+                pkt = frag;
+            else
+                add_frag_to_pkt(pkt, frag);
+
+            if (eop)
+                process_pkt(pkt);
+
+            new_packet = eop;
+
+            *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq++) = desc->addr;
+        }
+
+        xsk_ring_prod__submit(&xsk->umem->fq, rcvd);
+        xsk_ring_cons__release(&xsk->rx, rcvd);
+    }
+
+Usage Multi-Buffer Tx
+---------------------
+
+Here is an example Tx path pseudo-code (using libxdp interfaces for
+simplicity) ignoring that the umem is finite in size, and that we
+eventually will run out of packets to send. Also assumes pkts.addr
+points to a valid location in the umem.
+
+.. code-block:: c
+
+    void tx_packets(struct xsk_socket_info *xsk, struct pkt *pkts,
+                    int batch_size)
+    {
+        u32 idx, i, pkt_nb = 0;
+
+        xsk_ring_prod__reserve(&xsk->tx, batch_size, &idx);
+
+        for (i = 0; i < batch_size;) {
+            u64 addr = pkts[pkt_nb].addr;
+            u32 len = pkts[pkt_nb].size;
+
+            do {
+                struct xdp_desc *tx_desc;
+
+                tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx + i++);
+                tx_desc->addr = addr;
+
+                if (len > xsk_frame_size) {
+                    tx_desc->len = xsk_frame_size;
+                    tx_desc->options = XDP_PKT_CONTD;
+                } else {
+                    tx_desc->len = len;
+                    tx_desc->options = 0;
+                    pkt_nb++;
+                }
+                len -= tx_desc->len;
+                addr += xsk_frame_size;
+
+                if (i == batch_size) {
+                    /* Remember len, addr, pkt_nb for next iteration.
+                     * Skipped for simplicity.
+                     */
+                    break;
+                }
+            } while (len);
+        }
+
+        xsk_ring_prod__submit(&xsk->tx, i);
+    }
+
+Probing for Multi-Buffer Support
+--------------------------------
+
+To discover if a driver supports multi-buffer AF_XDP in SKB or DRV
+mode, use the XDP_FEATURES feature of netlink in linux/netdev.h to
+query for NETDEV_XDP_ACT_RX_SG support. This is the same flag as for
+querying for XDP multi-buffer support. If XDP supports multi-buffer in
+a driver, then AF_XDP will also support that in SKB and DRV mode.
+
+To discover if a driver supports multi-buffer AF_XDP in zero-copy
+mode, use XDP_FEATURES and first check the NETDEV_XDP_ACT_XSK_ZEROCOPY
+flag. If it is set, it means that at least zero-copy is supported and
+you should go and check the netlink attribute
+NETDEV_A_DEV_XDP_ZC_MAX_SEGS in linux/netdev.h. An unsigned integer
+value will be returned stating the max number of frags that are
+supported by this device in zero-copy mode. These are the possible
+return values:
+
+1: Multi-buffer for zero-copy is not supported by this device, as max
+   one fragment supported means that multi-buffer is not possible.
+
+>=2: Multi-buffer is supported in zero-copy mode for this device. The
+     returned number signifies the max number of frags supported.
+
+For an example on how these are used through libbpf, please take a
+look at tools/testing/selftests/bpf/xskxceiver.c.
+
+Multi-Buffer Support for Zero-Copy Drivers
+------------------------------------------
+
+Zero-copy drivers usually use the batched APIs for Rx and Tx
+processing. Note that the Tx batch API guarantees that it will provide
+a batch of Tx descriptors that ends with full packet at the end. This
+to facilitate extending a zero-copy driver with multi-buffer support.
+
 Sample application
 ==================
 
@@ -555,7 +811,7 @@ A: When a netdev of a physical NIC is initialized, Linux usually
    A number of other ways are possible all up to the capabilities of
    the NIC you have.
 
-Q: Can I use the XSKMAP to implement a switch betwen different umems
+Q: Can I use the XSKMAP to implement a switch between different umems
    in copy mode?
 
 A: The short answer is no, that is not supported at the moment. The
@@ -567,6 +823,21 @@ A: The short answer is no, that is not supported at the moment. The
    switch, or other distribution mechanism, in your NIC to direct
    traffic to the correct queue id and socket.
 
+   Note that if you are using the XDP_SHARED_UMEM option, it is
+   possible to switch traffic between any socket bound to the same
+   umem.
+
+Q: My packets are sometimes corrupted. What is wrong?
+
+A: Care has to be taken not to feed the same buffer in the UMEM into
+   more than one ring at the same time. If you for example feed the
+   same buffer into the FILL ring and the TX ring at the same time, the
+   NIC might receive data into the buffer at the same time it is
+   sending it. This will cause some packets to become corrupted. Same
+   thing goes for feeding the same buffer into the FILL rings
+   belonging to different queue ids or netdevs bound with the
+   XDP_SHARED_UMEM flag.
+
 Credits
 =======