summaryrefslogtreecommitdiff
path: root/Documentation/networking
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/networking')
-rw-r--r--Documentation/networking/af_xdp.rst48
-rw-r--r--Documentation/networking/arcnet-hardware.rst2
-rw-r--r--Documentation/networking/bonding.rst11
-rw-r--r--Documentation/networking/can.rst11
-rw-r--r--Documentation/networking/dccp.rst219
-rw-r--r--Documentation/networking/device_drivers/ethernet/amazon/ena.rst108
-rw-r--r--Documentation/networking/device_drivers/ethernet/huawei/hinic3.rst137
-rw-r--r--Documentation/networking/device_drivers/ethernet/index.rst4
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/ice.rst13
-rw-r--r--Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst2
-rw-r--r--Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst32
-rw-r--r--Documentation/networking/device_drivers/ethernet/meta/fbnic.rst90
-rw-r--r--Documentation/networking/device_drivers/ethernet/ti/cpsw.rst6
-rw-r--r--Documentation/networking/device_drivers/ethernet/ti/icssg_prueth.rst56
-rw-r--r--Documentation/networking/device_drivers/ethernet/wangxun/ngbevf.rst16
-rw-r--r--Documentation/networking/device_drivers/ethernet/wangxun/txgbevf.rst16
-rw-r--r--Documentation/networking/devlink/devlink-info.rst4
-rw-r--r--Documentation/networking/devlink/devlink-params.rst6
-rw-r--r--Documentation/networking/devlink/devlink-port.rst8
-rw-r--r--Documentation/networking/devlink/devlink-trap.rst2
-rw-r--r--Documentation/networking/devlink/index.rst4
-rw-r--r--Documentation/networking/devlink/ixgbe.rst171
-rw-r--r--Documentation/networking/devlink/kvaser_pciefd.rst24
-rw-r--r--Documentation/networking/devlink/kvaser_usb.rst33
-rw-r--r--Documentation/networking/devlink/netdevsim.rst2
-rw-r--r--Documentation/networking/devlink/zl3073x.rst51
-rw-r--r--Documentation/networking/devmem.rst150
-rw-r--r--Documentation/networking/ethtool-netlink.rst131
-rw-r--r--Documentation/networking/index.rst1
-rw-r--r--Documentation/networking/ip-sysctl.rst776
-rw-r--r--Documentation/networking/napi.rst9
-rw-r--r--Documentation/networking/net_cachelines/net_device.rst5
-rw-r--r--Documentation/networking/net_cachelines/snmp.rst3
-rw-r--r--Documentation/networking/net_cachelines/tcp_sock.rst2
-rw-r--r--Documentation/networking/netconsole.rst32
-rw-r--r--Documentation/networking/netdev-features.rst5
-rw-r--r--Documentation/networking/netdevices.rst96
-rw-r--r--Documentation/networking/netmem.rst23
-rw-r--r--Documentation/networking/nf_conntrack-sysctl.rst1
-rw-r--r--Documentation/networking/phy.rst7
-rw-r--r--Documentation/networking/rds.rst8
-rw-r--r--Documentation/networking/rxrpc.rst39
-rw-r--r--Documentation/networking/timestamping.rst8
-rw-r--r--Documentation/networking/tls.rst4
-rw-r--r--Documentation/networking/tproxy.rst4
-rw-r--r--Documentation/networking/xdp-rx-metadata.rst33
-rw-r--r--Documentation/networking/xfrm_device.rst10
47 files changed, 1892 insertions, 531 deletions
diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
index dceeb0d763aa..50d92084a49c 100644
--- a/Documentation/networking/af_xdp.rst
+++ b/Documentation/networking/af_xdp.rst
@@ -209,13 +209,10 @@ Libbpf
Libbpf is a helper library for eBPF and XDP that makes using these
technologies a lot simpler. It also contains specific helper functions
-in tools/lib/bpf/xsk.h for facilitating the use of AF_XDP. It
-contains two types of functions: those that can be used to make the
-setup of AF_XDP socket easier and ones that can be used in the data
-plane to access the rings safely and quickly. To see an example on how
-to use this API, please take a look at the sample application in
-samples/bpf/xdpsock_usr.c which uses libbpf for both setup and data
-plane operations.
+in tools/testing/selftests/bpf/xsk.h for facilitating the use of
+AF_XDP. It contains two types of functions: those that can be used to
+make the setup of AF_XDP socket easier and ones that can be used in the
+data plane to access the rings safely and quickly.
We recommend that you use this library unless you have become a power
user. It will make your program a lot simpler.
@@ -372,9 +369,8 @@ needs to explicitly notify the kernel to send any packets put on the
TX ring. This can be accomplished either by a poll() call, as in the
RX path, or by calling sendto().
-An example of how to use this flag can be found in
-samples/bpf/xdpsock_user.c. An example with the use of libbpf helpers
-would look like this for the TX path:
+An example with the use of libbpf helpers would look like this for the
+TX path:
.. code-block:: c
@@ -442,6 +438,15 @@ is created by a privileged process and passed to a non-privileged one.
Once the option is set, kernel will refuse attempts to bind that socket
to a different interface. Updating the value requires CAP_NET_RAW.
+XDP_MAX_TX_SKB_BUDGET setsockopt
+--------------------------------
+
+This setsockopt sets the maximum number of descriptors that can be handled
+and passed to the driver at one send syscall. It is applied in the copy
+mode to allow application to tune the per-socket maximum iteration for
+better throughput and less frequency of send syscall.
+Allowed range is [32, xs->tx->nentries].
+
XDP_STATISTICS getsockopt
-------------------------
@@ -549,12 +554,12 @@ later in this document.
Usage
-----
-In order to use AF_XDP sockets two parts are needed. The
-user-space application and the XDP program. For a complete setup and
-usage example, please refer to the sample application. The user-space
-side is xdpsock_user.c and the XDP side is part of libbpf.
+In order to use AF_XDP sockets two parts are needed. The user-space
+application and the XDP program. For a complete setup and usage example,
+please refer to the xdp-project at
+https://github.com/xdp-project/bpf-examples/tree/main/AF_XDP-example.
-The XDP code sample included in tools/lib/bpf/xsk.c is the following:
+The XDP code sample is the following:
.. code-block:: c
@@ -752,11 +757,12 @@ to facilitate extending a zero-copy driver with multi-buffer support.
Sample application
==================
-
-There is a xdpsock benchmarking/test application included that
-demonstrates how to use AF_XDP sockets with private UMEMs. Say that
-you would like your UDP traffic from port 4242 to end up in queue 16,
-that we will enable AF_XDP on. Here, we use ethtool for this::
+There is a xdpsock benchmarking/test application that can be found at
+https://github.com/xdp-project/bpf-examples/tree/main/AF_XDP-example
+that demonstrates how to use AF_XDP sockets with private
+UMEMs. Say that you would like your UDP traffic from port 4242 to end
+up in queue 16, that we will enable AF_XDP on. Here, we use ethtool
+for this::
ethtool -N p3p2 rx-flow-hash udp4 fn
ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
@@ -773,7 +779,7 @@ can be displayed with "-h", as usual.
This sample application uses libbpf to make the setup and usage of
AF_XDP simpler. If you want to know how the raw uapi of AF_XDP is
really used to make something more advanced, take a look at the libbpf
-code in tools/lib/bpf/xsk.[ch].
+code in tools/testing/selftests/bpf/xsk.[ch].
FAQ
=======
diff --git a/Documentation/networking/arcnet-hardware.rst b/Documentation/networking/arcnet-hardware.rst
index 982215723582..3bf7f99cd7bb 100644
--- a/Documentation/networking/arcnet-hardware.rst
+++ b/Documentation/networking/arcnet-hardware.rst
@@ -3152,7 +3152,7 @@ Tiara
(model unknown)
---------------
- - from Christoph Lameter <christoph@lameter.com>
+ - from Christoph Lameter <cl@gentwo.org>
Here is information about my card as far as I could figure it out::
diff --git a/Documentation/networking/bonding.rst b/Documentation/networking/bonding.rst
index a4c1291d2561..f8f5766703d4 100644
--- a/Documentation/networking/bonding.rst
+++ b/Documentation/networking/bonding.rst
@@ -562,6 +562,12 @@ lacp_rate
The default is slow.
+broadcast_neighbor
+
+ Option specifying whether to broadcast ARP/ND packets to all
+ active slaves. This option has no effect in modes other than
+ 802.3ad mode. The default is off (0).
+
max_bonds
Specifies the number of bonding devices to create for this
@@ -767,8 +773,9 @@ num_unsol_na
greater than 1.
The valid range is 0 - 255; the default value is 1. These options
- affect only the active-backup mode. These options were added for
- bonding versions 3.3.0 and 3.4.0 respectively.
+ affect the active-backup or 802.3ad (broadcast_neighbor enabled) mode.
+ These options were added for bonding versions 3.3.0 and 3.4.0
+ respectively.
From Linux 3.0 and bonding version 3.7.1, these notifications
are generated by the ipv4 and ipv6 code and the numbers of
diff --git a/Documentation/networking/can.rst b/Documentation/networking/can.rst
index b018ce346392..bc1b585355f7 100644
--- a/Documentation/networking/can.rst
+++ b/Documentation/networking/can.rst
@@ -1104,15 +1104,12 @@ for writing CAN network device driver are described below:
General Settings
----------------
-.. code-block:: C
-
- dev->type = ARPHRD_CAN; /* the netdevice hardware type */
- dev->flags = IFF_NOARP; /* CAN has no arp */
+CAN network device drivers can use alloc_candev_mqs() and friends instead of
+alloc_netdev_mqs(), to automatically take care of CAN-specific setup:
- dev->mtu = CAN_MTU; /* sizeof(struct can_frame) -> Classical CAN interface */
+.. code-block:: C
- or alternative, when the controller supports CAN with flexible data rate:
- dev->mtu = CANFD_MTU; /* sizeof(struct canfd_frame) -> CAN FD interface */
+ dev = alloc_candev_mqs(...);
The struct can_frame or struct canfd_frame is the payload of each socket
buffer (skbuff) in the protocol family PF_CAN.
diff --git a/Documentation/networking/dccp.rst b/Documentation/networking/dccp.rst
deleted file mode 100644
index 91e5c33ba3ff..000000000000
--- a/Documentation/networking/dccp.rst
+++ /dev/null
@@ -1,219 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-=============
-DCCP protocol
-=============
-
-
-.. Contents
- - Introduction
- - Missing features
- - Socket options
- - Sysctl variables
- - IOCTLs
- - Other tunables
- - Notes
-
-
-Introduction
-============
-Datagram Congestion Control Protocol (DCCP) is an unreliable, connection
-oriented protocol designed to solve issues present in UDP and TCP, particularly
-for real-time and multimedia (streaming) traffic.
-It divides into a base protocol (RFC 4340) and pluggable congestion control
-modules called CCIDs. Like pluggable TCP congestion control, at least one CCID
-needs to be enabled in order for the protocol to function properly. In the Linux
-implementation, this is the TCP-like CCID2 (RFC 4341). Additional CCIDs, such as
-the TCP-friendly CCID3 (RFC 4342), are optional.
-For a brief introduction to CCIDs and suggestions for choosing a CCID to match
-given applications, see section 10 of RFC 4340.
-
-It has a base protocol and pluggable congestion control IDs (CCIDs).
-
-DCCP is a Proposed Standard (RFC 2026), and the homepage for DCCP as a protocol
-is at http://www.ietf.org/html.charters/dccp-charter.html
-
-
-Missing features
-================
-The Linux DCCP implementation does not currently support all the features that are
-specified in RFCs 4340...42.
-
-The known bugs are at:
-
- http://www.linuxfoundation.org/collaborate/workgroups/networking/todo#DCCP
-
-For more up-to-date versions of the DCCP implementation, please consider using
-the experimental DCCP test tree; instructions for checking this out are on:
-http://www.linuxfoundation.org/collaborate/workgroups/networking/dccp_testing#Experimental_DCCP_source_tree
-
-
-Socket options
-==============
-DCCP_SOCKOPT_QPOLICY_ID sets the dequeuing policy for outgoing packets. It takes
-a policy ID as argument and can only be set before the connection (i.e. changes
-during an established connection are not supported). Currently, two policies are
-defined: the "simple" policy (DCCPQ_POLICY_SIMPLE), which does nothing special,
-and a priority-based variant (DCCPQ_POLICY_PRIO). The latter allows to pass an
-u32 priority value as ancillary data to sendmsg(), where higher numbers indicate
-a higher packet priority (similar to SO_PRIORITY). This ancillary data needs to
-be formatted using a cmsg(3) message header filled in as follows::
-
- cmsg->cmsg_level = SOL_DCCP;
- cmsg->cmsg_type = DCCP_SCM_PRIORITY;
- cmsg->cmsg_len = CMSG_LEN(sizeof(uint32_t)); /* or CMSG_LEN(4) */
-
-DCCP_SOCKOPT_QPOLICY_TXQLEN sets the maximum length of the output queue. A zero
-value is always interpreted as unbounded queue length. If different from zero,
-the interpretation of this parameter depends on the current dequeuing policy
-(see above): the "simple" policy will enforce a fixed queue size by returning
-EAGAIN, whereas the "prio" policy enforces a fixed queue length by dropping the
-lowest-priority packet first. The default value for this parameter is
-initialised from /proc/sys/net/dccp/default/tx_qlen.
-
-DCCP_SOCKOPT_SERVICE sets the service. The specification mandates use of
-service codes (RFC 4340, sec. 8.1.2); if this socket option is not set,
-the socket will fall back to 0 (which means that no meaningful service code
-is present). On active sockets this is set before connect(); specifying more
-than one code has no effect (all subsequent service codes are ignored). The
-case is different for passive sockets, where multiple service codes (up to 32)
-can be set before calling bind().
-
-DCCP_SOCKOPT_GET_CUR_MPS is read-only and retrieves the current maximum packet
-size (application payload size) in bytes, see RFC 4340, section 14.
-
-DCCP_SOCKOPT_AVAILABLE_CCIDS is also read-only and returns the list of CCIDs
-supported by the endpoint. The option value is an array of type uint8_t whose
-size is passed as option length. The minimum array size is 4 elements, the
-value returned in the optlen argument always reflects the true number of
-built-in CCIDs.
-
-DCCP_SOCKOPT_CCID is write-only and sets both the TX and RX CCIDs at the same
-time, combining the operation of the next two socket options. This option is
-preferable over the latter two, since often applications will use the same
-type of CCID for both directions; and mixed use of CCIDs is not currently well
-understood. This socket option takes as argument at least one uint8_t value, or
-an array of uint8_t values, which must match available CCIDS (see above). CCIDs
-must be registered on the socket before calling connect() or listen().
-
-DCCP_SOCKOPT_TX_CCID is read/write. It returns the current CCID (if set) or sets
-the preference list for the TX CCID, using the same format as DCCP_SOCKOPT_CCID.
-Please note that the getsockopt argument type here is ``int``, not uint8_t.
-
-DCCP_SOCKOPT_RX_CCID is analogous to DCCP_SOCKOPT_TX_CCID, but for the RX CCID.
-
-DCCP_SOCKOPT_SERVER_TIMEWAIT enables the server (listening socket) to hold
-timewait state when closing the connection (RFC 4340, 8.3). The usual case is
-that the closing server sends a CloseReq, whereupon the client holds timewait
-state. When this boolean socket option is on, the server sends a Close instead
-and will enter TIMEWAIT. This option must be set after accept() returns.
-
-DCCP_SOCKOPT_SEND_CSCOV and DCCP_SOCKOPT_RECV_CSCOV are used for setting the
-partial checksum coverage (RFC 4340, sec. 9.2). The default is that checksums
-always cover the entire packet and that only fully covered application data is
-accepted by the receiver. Hence, when using this feature on the sender, it must
-be enabled at the receiver, too with suitable choice of CsCov.
-
-DCCP_SOCKOPT_SEND_CSCOV sets the sender checksum coverage. Values in the
- range 0..15 are acceptable. The default setting is 0 (full coverage),
- values between 1..15 indicate partial coverage.
-
-DCCP_SOCKOPT_RECV_CSCOV is for the receiver and has a different meaning: it
- sets a threshold, where again values 0..15 are acceptable. The default
- of 0 means that all packets with a partial coverage will be discarded.
- Values in the range 1..15 indicate that packets with minimally such a
- coverage value are also acceptable. The higher the number, the more
- restrictive this setting (see [RFC 4340, sec. 9.2.1]). Partial coverage
- settings are inherited to the child socket after accept().
-
-The following two options apply to CCID 3 exclusively and are getsockopt()-only.
-In either case, a TFRC info struct (defined in <linux/tfrc.h>) is returned.
-
-DCCP_SOCKOPT_CCID_RX_INFO
- Returns a ``struct tfrc_rx_info`` in optval; the buffer for optval and
- optlen must be set to at least sizeof(struct tfrc_rx_info).
-
-DCCP_SOCKOPT_CCID_TX_INFO
- Returns a ``struct tfrc_tx_info`` in optval; the buffer for optval and
- optlen must be set to at least sizeof(struct tfrc_tx_info).
-
-On unidirectional connections it is useful to close the unused half-connection
-via shutdown (SHUT_WR or SHUT_RD): this will reduce per-packet processing costs.
-
-
-Sysctl variables
-================
-Several DCCP default parameters can be managed by the following sysctls
-(sysctl net.dccp.default or /proc/sys/net/dccp/default):
-
-request_retries
- The number of active connection initiation retries (the number of
- Requests minus one) before timing out. In addition, it also governs
- the behaviour of the other, passive side: this variable also sets
- the number of times DCCP repeats sending a Response when the initial
- handshake does not progress from RESPOND to OPEN (i.e. when no Ack
- is received after the initial Request). This value should be greater
- than 0, suggested is less than 10. Analogue of tcp_syn_retries.
-
-retries1
- How often a DCCP Response is retransmitted until the listening DCCP
- side considers its connecting peer dead. Analogue of tcp_retries1.
-
-retries2
- The number of times a general DCCP packet is retransmitted. This has
- importance for retransmitted acknowledgments and feature negotiation,
- data packets are never retransmitted. Analogue of tcp_retries2.
-
-tx_ccid = 2
- Default CCID for the sender-receiver half-connection. Depending on the
- choice of CCID, the Send Ack Vector feature is enabled automatically.
-
-rx_ccid = 2
- Default CCID for the receiver-sender half-connection; see tx_ccid.
-
-seq_window = 100
- The initial sequence window (sec. 7.5.2) of the sender. This influences
- the local ackno validity and the remote seqno validity windows (7.5.1).
- Values in the range Wmin = 32 (RFC 4340, 7.5.2) up to 2^32-1 can be set.
-
-tx_qlen = 5
- The size of the transmit buffer in packets. A value of 0 corresponds
- to an unbounded transmit buffer.
-
-sync_ratelimit = 125 ms
- The timeout between subsequent DCCP-Sync packets sent in response to
- sequence-invalid packets on the same socket (RFC 4340, 7.5.4). The unit
- of this parameter is milliseconds; a value of 0 disables rate-limiting.
-
-
-IOCTLS
-======
-FIONREAD
- Works as in udp(7): returns in the ``int`` argument pointer the size of
- the next pending datagram in bytes, or 0 when no datagram is pending.
-
-SIOCOUTQ
- Returns the number of unsent data bytes in the socket send queue as ``int``
- into the buffer specified by the argument pointer.
-
-Other tunables
-==============
-Per-route rto_min support
- CCID-2 supports the RTAX_RTO_MIN per-route setting for the minimum value
- of the RTO timer. This setting can be modified via the 'rto_min' option
- of iproute2; for example::
-
- > ip route change 10.0.0.0/24 rto_min 250j dev wlan0
- > ip route add 10.0.0.254/32 rto_min 800j dev wlan0
- > ip route show dev wlan0
-
- CCID-3 also supports the rto_min setting: it is used to define the lower
- bound for the expiry of the nofeedback timer. This can be useful on LANs
- with very low RTTs (e.g., loopback, Gbit ethernet).
-
-
-Notes
-=====
-DCCP does not travel through NAT successfully at present on many boxes. This is
-because the checksum covers the pseudo-header as per TCP and UDP. Linux NAT
-support for DCCP has been added.
diff --git a/Documentation/networking/device_drivers/ethernet/amazon/ena.rst b/Documentation/networking/device_drivers/ethernet/amazon/ena.rst
index 4561e8ab9e08..14784a0a6a8a 100644
--- a/Documentation/networking/device_drivers/ethernet/amazon/ena.rst
+++ b/Documentation/networking/device_drivers/ethernet/amazon/ena.rst
@@ -56,6 +56,9 @@ ena_netdev.[ch] Main Linux kernel driver.
ena_ethtool.c ethtool callbacks.
ena_xdp.[ch] XDP files
ena_pci_id_tbl.h Supported device IDs.
+ena_phc.[ch] PTP hardware clock infrastructure (see `PHC`_ for more info)
+ena_devlink.[ch] devlink files.
+ena_debugfs.[ch] debugfs files.
================= ======================================================
Management Interface:
@@ -221,6 +224,99 @@ descriptor it was received on would be recycled. When a packet smaller
than RX copybreak bytes is received, it is copied into a new memory
buffer and the RX descriptor is returned to HW.
+.. _`PHC`:
+
+PTP Hardware Clock (PHC)
+========================
+.. _`ptp-userspace-api`: https://docs.kernel.org/driver-api/ptp.html#ptp-hardware-clock-user-space-api
+.. _`testptp`: https://elixir.bootlin.com/linux/latest/source/tools/testing/selftests/ptp/testptp.c
+
+ENA Linux driver supports PTP hardware clock providing timestamp reference to achieve nanosecond resolution.
+
+**PHC support**
+
+PHC depends on the PTP module, which needs to be either loaded as a module or compiled into the kernel.
+
+Verify if the PTP module is present:
+
+.. code-block:: shell
+
+ grep -w '^CONFIG_PTP_1588_CLOCK=[ym]' /boot/config-`uname -r`
+
+- If no output is provided, the ENA driver cannot be loaded with PHC support.
+
+**PHC activation**
+
+The feature is turned off by default, in order to turn the feature on, the ENA driver
+can be loaded in the following way:
+
+- devlink:
+
+.. code-block:: shell
+
+ sudo devlink dev param set pci/<domain:bus:slot.function> name enable_phc value true cmode driverinit
+ sudo devlink dev reload pci/<domain:bus:slot.function>
+ # for example:
+ sudo devlink dev param set pci/0000:00:06.0 name enable_phc value true cmode driverinit
+ sudo devlink dev reload pci/0000:00:06.0
+
+All available PTP clock sources can be tracked here:
+
+.. code-block:: shell
+
+ ls /sys/class/ptp
+
+PHC support and capabilities can be verified using ethtool:
+
+.. code-block:: shell
+
+ ethtool -T <interface>
+
+**PHC timestamp**
+
+To retrieve PHC timestamp, use `ptp-userspace-api`_, usage example using `testptp`_:
+
+.. code-block:: shell
+
+ testptp -d /dev/ptp$(ethtool -T <interface> | awk '/PTP Hardware Clock:/ {print $NF}') -k 1
+
+PHC get time requests should be within reasonable bounds,
+avoid excessive utilization to ensure optimal performance and efficiency.
+The ENA device restricts the frequency of PHC get time requests to a maximum
+of 125 requests per second. If this limit is surpassed, the get time request
+will fail, leading to an increment in the phc_err_ts statistic.
+
+**PHC statistics**
+
+PHC can be monitored using debugfs (if mounted):
+
+.. code-block:: shell
+
+ sudo cat /sys/kernel/debug/<domain:bus:slot.function>/phc_stats
+
+ # for example:
+ sudo cat /sys/kernel/debug/0000:00:06.0/phc_stats
+
+PHC errors must remain below 1% of all PHC requests to maintain the desired level of accuracy and reliability
+
+================= ======================================================
+**phc_cnt** | Number of successful retrieved timestamps (below expire timeout).
+**phc_exp** | Number of expired retrieved timestamps (above expire timeout).
+**phc_skp** | Number of skipped get time attempts (during block period).
+**phc_err_dv** | Number of failed get time attempts due to device errors (entering into block state).
+**phc_err_ts** | Number of failed get time attempts due to timestamp errors (entering into block state),
+ | This occurs if driver exceeded the request limit or device received an invalid timestamp.
+================= ======================================================
+
+PHC timeouts:
+
+================= ======================================================
+**expire** | Max time for a valid timestamp retrieval, passing this threshold will fail
+ | the get time request and block new requests until block timeout.
+**block** | Blocking period starts once get time request expires or fails,
+ | all get time requests during block period will be skipped.
+================= ======================================================
+
Statistics
==========
@@ -268,6 +364,18 @@ RSS
- The user can provide a hash key, hash function, and configure the
indirection table through `ethtool(8)`.
+DEVLINK SUPPORT
+===============
+.. _`devlink`: https://www.kernel.org/doc/html/latest/networking/devlink/index.html
+
+`devlink`_ supports reloading the driver and initiating re-negotiation with the ENA device
+
+.. code-block:: shell
+
+ sudo devlink dev reload pci/<domain:bus:slot.function>
+ # for example:
+ sudo devlink dev reload pci/0000:00:06.0
+
DATA PATH
=========
diff --git a/Documentation/networking/device_drivers/ethernet/huawei/hinic3.rst b/Documentation/networking/device_drivers/ethernet/huawei/hinic3.rst
new file mode 100644
index 000000000000..e3dfd083fa52
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/huawei/hinic3.rst
@@ -0,0 +1,137 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================================================================
+Linux kernel driver for Huawei Ethernet Device Driver (hinic3) family
+=====================================================================
+
+Overview
+========
+
+The hinic3 is a network interface card (NIC) for Data Center. It supports
+a range of link-speed devices (10GE, 25GE, 100GE, etc.). The hinic3
+devices can have multiple physical forms: LOM (Lan on Motherboard) NIC,
+PCIe standard NIC, OCP (Open Compute Project) NIC, etc.
+
+The hinic3 driver supports the following features:
+- IPv4/IPv6 TCP/UDP checksum offload
+- TSO (TCP Segmentation Offload), LRO (Large Receive Offload)
+- RSS (Receive Side Scaling)
+- MSI-X interrupt aggregation configuration and interrupt adaptation.
+- SR-IOV (Single Root I/O Virtualization).
+
+Content
+=======
+
+- Supported PCI vendor ID/device IDs
+- Source Code Structure of Hinic3 Driver
+- Management Interface
+
+Supported PCI vendor ID/device IDs
+==================================
+
+19e5:0222 - hinic3 PF/PPF
+19e5:375F - hinic3 VF
+
+Prime Physical Function (PPF) is responsible for the management of the
+whole NIC card. For example, clock synchronization between the NIC and
+the host. Any PF may serve as a PPF. The PPF is selected dynamically.
+
+Source Code Structure of Hinic3 Driver
+======================================
+
+======================== ================================================
+hinic3_pci_id_tbl.h Supported device IDs
+hinic3_hw_intf.h Interface between HW and driver
+hinic3_queue_common.[ch] Common structures and methods for NIC queues
+hinic3_common.[ch] Encapsulation of memory operations in Linux
+hinic3_csr.h Register definitions in the BAR
+hinic3_hwif.[ch] Interface for BAR
+hinic3_eqs.[ch] Interface for AEQs and CEQs
+hinic3_mbox.[ch] Interface for mailbox
+hinic3_mgmt.[ch] Management interface based on mailbox and AEQ
+hinic3_wq.[ch] Work queue data structures and interface
+hinic3_cmdq.[ch] Command queue is used to post command to HW
+hinic3_hwdev.[ch] HW structures and methods abstractions
+hinic3_lld.[ch] Auxiliary driver adaptation layer
+hinic3_hw_comm.[ch] Interface for common HW operations
+hinic3_mgmt_interface.h Interface between firmware and driver
+hinic3_hw_cfg.[ch] Interface for HW configuration
+hinic3_irq.c Interrupt request
+hinic3_netdev_ops.c Operations registered to Linux kernel stack
+hinic3_nic_dev.h NIC structures and methods abstractions
+hinic3_main.c Main Linux kernel driver
+hinic3_nic_cfg.[ch] NIC service configuration
+hinic3_nic_io.[ch] Management plane interface for TX and RX
+hinic3_rss.[ch] Interface for Receive Side Scaling (RSS)
+hinic3_rx.[ch] Interface for transmit
+hinic3_tx.[ch] Interface for receive
+hinic3_ethtool.c Interface for ethtool operations (ops)
+hinic3_filter.c Interface for MAC address
+======================== ================================================
+
+Management Interface
+====================
+
+Asynchronous Event Queue (AEQ)
+------------------------------
+
+AEQ receives high priority events from the HW over a descriptor queue.
+Every descriptor is a fixed size of 64 bytes. AEQ can receive solicited or
+unsolicited events. Every device, VF or PF, can have up to 4 AEQs.
+Every AEQ is associated to a dedicated IRQ. AEQ can receive multiple types
+of events, but in practice the hinic3 driver ignores all events except for
+2 mailbox related events.
+
+Mailbox
+-------
+
+Mailbox is a communication mechanism between the hinic3 driver and the HW.
+Each device has an independent mailbox. Driver can use the mailbox to send
+requests to management. Driver receives mailbox messages, such as responses
+to requests, over the AEQ (using event HINIC3_AEQ_FOR_MBOX). Due to the
+limited size of mailbox data register, mailbox messages are sent
+segment-by-segment.
+
+Every device can use its mailbox to post request to firmware. The mailbox
+can also be used to post requests and responses between the PF and its VFs.
+
+Completion Event Queue (CEQ)
+----------------------------
+
+The implementation of CEQ is the same as AEQ. It receives completion events
+from HW over a fixed size descriptor of 32 bits. Every device can have up
+to 32 CEQs. Every CEQ has a dedicated IRQ. CEQ only receives solicited
+events that are responses to requests from the driver. CEQ can receive
+multiple types of events, but in practice the hinic3 driver ignores all
+events except for HINIC3_CMDQ that represents completion of previously
+posted commands on a cmdq.
+
+Command Queue (cmdq)
+--------------------
+
+Every cmdq has a dedicated work queue on which commands are posted.
+Commands on the work queue are fixed size descriptor of size 64 bytes.
+Completion of a command will be indicated using ctrl bits in the
+descriptor that carried the command. Notification of command completions
+will also be provided via event on CEQ. Every device has 4 command queues
+that are initialized as a set (called cmdqs), each with its own type.
+Hinic3 driver only uses type HINIC3_CMDQ_SYNC.
+
+Work Queues(WQ)
+---------------
+
+Work queues are logical arrays of fixed size WQEs. The array may be spread
+over multiple non-contiguous pages using indirection table. Work queues are
+used by I/O queues and command queues.
+
+Global function ID
+------------------
+
+Every function, PF or VF, has a unique ordinal identification within the device.
+Many management commands (mbox or cmdq) contain this ID so HW can apply the
+command effect to the right function.
+
+PF is allowed to post management commands to a subordinate VF by specifying the
+VFs ID. A VF must provide its own ID. Anti-spoofing in the HW will cause
+command from a VF to fail if it contains the wrong ID.
+
diff --git a/Documentation/networking/device_drivers/ethernet/index.rst b/Documentation/networking/device_drivers/ethernet/index.rst
index 05d822b904b4..40ac552641a3 100644
--- a/Documentation/networking/device_drivers/ethernet/index.rst
+++ b/Documentation/networking/device_drivers/ethernet/index.rst
@@ -28,6 +28,7 @@ Contents:
freescale/gianfar
google/gve
huawei/hinic
+ huawei/hinic3
intel/e100
intel/e1000
intel/e1000e
@@ -55,8 +56,11 @@ Contents:
ti/cpsw_switchdev
ti/am65_nuss_cpsw_switchdev
ti/tlan
+ ti/icssg_prueth
wangxun/txgbe
+ wangxun/txgbevf
wangxun/ngbe
+ wangxun/ngbevf
.. only:: subproject and html
diff --git a/Documentation/networking/device_drivers/ethernet/intel/ice.rst b/Documentation/networking/device_drivers/ethernet/intel/ice.rst
index 3c46a48d99ba..0bca293cf9cb 100644
--- a/Documentation/networking/device_drivers/ethernet/intel/ice.rst
+++ b/Documentation/networking/device_drivers/ethernet/intel/ice.rst
@@ -927,6 +927,19 @@ To enable/disable UDP Segmentation Offload, issue the following command::
# ethtool -K <ethX> tx-udp-segmentation [off|on]
+PTP pin interface
+-----------------
+All adapters support standard PTP pin interface. SDPs (Software Definable Pin)
+are single ended pins with both periodic output and external timestamp
+supported. There are also specific differential input/output pins (TIME_SYNC,
+1PPS) with only one of the functions supported.
+
+There are adapters with DPLL, where pins are connected to the DPLL instead of
+being exposed on the board. You have to be aware that in those configurations,
+only SDP pins are exposed and each pin has its own fixed direction.
+To see input signal on those PTP pins, you need to configure DPLL properly.
+Output signal is only visible on DPLL and to send it to the board SMA/U.FL pins,
+DPLL output pins have to be manually configured.
GNSS module
-----------
diff --git a/Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst b/Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst
index af7db0e91f6b..a52850602cd8 100644
--- a/Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst
+++ b/Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst
@@ -66,7 +66,7 @@ Admin Function driver
As mentioned above RVU PF0 is called the admin function (AF), this driver
supports resource provisioning and configuration of functional blocks.
Doesn't handle any I/O. It sets up few basic stuff but most of the
-funcionality is achieved via configuration requests from PFs and VFs.
+functionality is achieved via configuration requests from PFs and VFs.
PF/VFs communicates with AF via a shared memory region (mailbox). Upon
receiving requests AF does resource provisioning and other HW configuration.
diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst
index 43d72c8b713b..754c81436408 100644
--- a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst
+++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst
@@ -1341,3 +1341,35 @@ Device Counters
- The number of times the device owned queue had not enough buffers
allocated.
- Error
+
+ * - `pci_bw_inbound_high`
+ - The number of times the device crossed the high inbound pcie bandwidth
+ threshold. To be compared to pci_bw_inbound_low to check if the device
+ is in a congested state.
+ If pci_bw_inbound_high == pci_bw_inbound_low then the device is not congested.
+ If pci_bw_inbound_high > pci_bw_inbound_low then the device is congested.
+ - Tnformative
+
+ * - `pci_bw_inbound_low`
+ - The number of times the device crossed the low inbound PCIe bandwidth
+ threshold. To be compared to pci_bw_inbound_high to check if the device
+ is in a congested state.
+ If pci_bw_inbound_high == pci_bw_inbound_low then the device is not congested.
+ If pci_bw_inbound_high > pci_bw_inbound_low then the device is congested.
+ - Informative
+
+ * - `pci_bw_outbound_high`
+ - The number of times the device crossed the high outbound pcie bandwidth
+ threshold. To be compared to pci_bw_outbound_low to check if the device
+ is in a congested state.
+ If pci_bw_outbound_high == pci_bw_outbound_low then the device is not congested.
+ If pci_bw_outbound_high > pci_bw_outbound_low then the device is congested.
+ - Informative
+
+ * - `pci_bw_outbound_low`
+ - The number of times the device crossed the low outbound PCIe bandwidth
+ threshold. To be compared to pci_bw_outbound_high to check if the device
+ is in a congested state.
+ If pci_bw_outbound_high == pci_bw_outbound_low then the device is not congested.
+ If pci_bw_outbound_high > pci_bw_outbound_low then the device is congested.
+ - Informative
diff --git a/Documentation/networking/device_drivers/ethernet/meta/fbnic.rst b/Documentation/networking/device_drivers/ethernet/meta/fbnic.rst
index 04e0595bb0a7..afb8353daefd 100644
--- a/Documentation/networking/device_drivers/ethernet/meta/fbnic.rst
+++ b/Documentation/networking/device_drivers/ethernet/meta/fbnic.rst
@@ -28,9 +28,90 @@ devlink dev info provides version information for all three components. In
addition to the version the hg commit hash of the build is included as a
separate entry.
+Configuration
+-------------
+
+Ringparams (ethtool -g / -G)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+fbnic has two submission (host -> device) rings for every completion
+(device -> host) ring. The three ring objects together form a single
+"queue" as used by higher layer software (a Rx, or a Tx queue).
+
+For Rx the two submission rings are used to pass empty pages to the NIC.
+Ring 0 is the Header Page Queue (HPQ), NIC will use its pages to place
+L2-L4 headers (or full frames if frame is not header-data split).
+Ring 1 is the Payload Page Queue (PPQ) and used for packet payloads.
+The completion ring is used to receive packet notifications / metadata.
+ethtool ``rx`` ringparam maps to the size of the completion ring,
+``rx-mini`` to the HPQ, and ``rx-jumbo`` to the PPQ.
+
+For Tx both submission rings can be used to submit packets, the completion
+ring carries notifications for both. fbnic uses one of the submission
+rings for normal traffic from the stack and the second one for XDP frames.
+ethtool ``tx`` ringparam controls both the size of the submission rings
+and the completion ring.
+
+Every single entry on the HPQ and PPQ (``rx-mini``, ``rx-jumbo``)
+corresponds to 4kB of allocated memory, while entries on the remaining
+rings are in units of descriptors (8B). The ideal ratio of submission
+and completion ring sizes will depend on the workload, as for small packets
+multiple packets will fit into a single page.
+
+Upgrading Firmware
+------------------
+
+fbnic supports updating firmware using signed PLDM images with devlink dev
+flash. PLDM images are written into the flash. Flashing does not interrupt
+the operation of the device.
+
+On host boot the latest UEFI driver is always used, no explicit activation
+is required. Firmware activation is required to run new control firmware. cmrt
+firmware can only be activated by power cycling the NIC.
+
Statistics
----------
+TX MAC Interface
+~~~~~~~~~~~~~~~~
+
+ - ``ptp_illegal_req``: packets sent to the NIC with PTP request bit set but routed to BMC/FW
+ - ``ptp_good_ts``: packets successfully routed to MAC with PTP request bit set
+ - ``ptp_bad_ts``: packets destined for MAC with PTP request bit set but aborted because of some error (e.g., DMA read error)
+
+TX Extension (TEI) Interface (TTI)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+ - ``tti_cm_drop``: control messages dropped at the TX Extension (TEI) Interface because of credit starvation
+ - ``tti_frame_drop``: packets dropped at the TX Extension (TEI) Interface because of credit starvation
+ - ``tti_tbi_drop``: packets dropped at the TX BMC Interface (TBI) because of credit starvation
+
+RXB (RX Buffer) Enqueue
+~~~~~~~~~~~~~~~~~~~~~~~
+
+ - ``rxb_integrity_err[i]``: frames enqueued with integrity errors (e.g., multi-bit ECC errors) on RXB input i
+ - ``rxb_mac_err[i]``: frames enqueued with MAC end-of-frame errors (e.g., bad FCS) on RXB input i
+ - ``rxb_parser_err[i]``: frames experienced RPC parser errors
+ - ``rxb_frm_err[i]``: frames experienced signaling errors (e.g., missing end-of-packet/start-of-packet) on RXB input i
+ - ``rxb_drbo[i]_frames``: frames received at RXB input i
+ - ``rxb_drbo[i]_bytes``: bytes received at RXB input i
+
+RXB (RX Buffer) FIFO
+~~~~~~~~~~~~~~~~~~~~
+
+ - ``rxb_fifo[i]_drop``: transitions into the drop state on RXB pool i
+ - ``rxb_fifo[i]_dropped_frames``: frames dropped on RXB pool i
+ - ``rxb_fifo[i]_ecn``: transitions into the ECN mark state on RXB pool i
+ - ``rxb_fifo[i]_level``: current occupancy of RXB pool i
+
+RXB (RX Buffer) Dequeue
+~~~~~~~~~~~~~~~~~~~~~~~
+
+ - ``rxb_intf[i]_frames``: frames sent to the output i
+ - ``rxb_intf[i]_bytes``: bytes sent to the output i
+ - ``rxb_pbuf[i]_frames``: frames sent to output i from the perspective of internal packet buffer
+ - ``rxb_pbuf[i]_bytes``: bytes sent to output i from the perspective of internal packet buffer
+
RPC (Rx parser)
~~~~~~~~~~~~~~~
@@ -44,6 +125,15 @@ RPC (Rx parser)
- ``rpc_out_of_hdr_err``: frames where header was larger than parsable region
- ``ovr_size_err``: oversized frames
+Hardware Queues
+~~~~~~~~~~~~~~~
+
+1. RX DMA Engine:
+
+ - ``rde_[i]_pkt_err``: packets with MAC EOP, RPC parser, RXB truncation, or RDE frame truncation errors. These error are flagged in the packet metadata because of cut-through support but the actual drop happens once PCIE/RDE is reached.
+ - ``rde_[i]_pkt_cq_drop``: packets dropped because RCQ is full
+ - ``rde_[i]_pkt_bdq_drop``: packets dropped because HPQ or PPQ ran out of host buffer
+
PCIe
~~~~
diff --git a/Documentation/networking/device_drivers/ethernet/ti/cpsw.rst b/Documentation/networking/device_drivers/ethernet/ti/cpsw.rst
index a88946bd188b..d3e130455043 100644
--- a/Documentation/networking/device_drivers/ethernet/ti/cpsw.rst
+++ b/Documentation/networking/device_drivers/ethernet/ti/cpsw.rst
@@ -268,14 +268,14 @@ Example 1: One port tx AVB configuration scheme for target board
// Run your appropriate tools with socket option "SO_PRIORITY"
// to 3 for class A and/or to 2 for class B
- // (I took at https://www.spinics.net/lists/netdev/msg460869.html)
+ // (I took at https://lore.kernel.org/r/20171017010128.22141-1-vinicius.gomes@intel.com/)
./tsn_talker -d 18:03:73:66:87:42 -i eth0.100 -p3 -s 1500&
./tsn_talker -d 18:03:73:66:87:42 -i eth0.100 -p2 -s 1500&
13) ::
// run your listener on workstation (should be in same vlan)
- // (I took at https://www.spinics.net/lists/netdev/msg460869.html)
+ // (I took at https://lore.kernel.org/r/20171017010128.22141-1-vinicius.gomes@intel.com/)
./tsn_listener -d 18:03:73:66:87:42 -i enp5s0 -s 1500
Receiving data rate: 39012 kbps
Receiving data rate: 39012 kbps
@@ -555,7 +555,7 @@ Example 2: Two port tx AVB configuration scheme for target board
20) ::
// run your listener on workstation (should be in same vlan)
- // (I took at https://www.spinics.net/lists/netdev/msg460869.html)
+ // (I took at https://lore.kernel.org/r/20171017010128.22141-1-vinicius.gomes@intel.com/)
./tsn_listener -d 18:03:73:66:87:42 -i enp5s0 -s 1500
Receiving data rate: 39012 kbps
Receiving data rate: 39012 kbps
diff --git a/Documentation/networking/device_drivers/ethernet/ti/icssg_prueth.rst b/Documentation/networking/device_drivers/ethernet/ti/icssg_prueth.rst
new file mode 100644
index 000000000000..da21ddf431bb
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/ti/icssg_prueth.rst
@@ -0,0 +1,56 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============================================
+Texas Instruments ICSSG PRUETH ethernet driver
+==============================================
+
+:Version: 1.0
+
+ICSSG Firmware
+==============
+
+Every ICSSG core has two Programmable Real-Time Unit(PRUs), two auxiliary
+Real-Time Transfer Unit (RTUs), and two Transmit Real-Time Transfer Units
+(TX_PRUs). Each one of these runs its own firmware. The firmwares combnined are
+referred as ICSSG Firmware.
+
+Firmware Statistics
+===================
+
+The ICSSG firmware maintains certain statistics which are dumped by the driver
+via ``ethtool -S <interface>``
+
+These statistics are as follows,
+
+ - ``FW_RTU_PKT_DROP``: Diagnostic error counter which increments when RTU drops a locally injected packet due to port being disabled or rule violation.
+ - ``FW_Q0_OVERFLOW``: TX overflow counter for queue0
+ - ``FW_Q1_OVERFLOW``: TX overflow counter for queue1
+ - ``FW_Q2_OVERFLOW``: TX overflow counter for queue2
+ - ``FW_Q3_OVERFLOW``: TX overflow counter for queue3
+ - ``FW_Q4_OVERFLOW``: TX overflow counter for queue4
+ - ``FW_Q5_OVERFLOW``: TX overflow counter for queue5
+ - ``FW_Q6_OVERFLOW``: TX overflow counter for queue6
+ - ``FW_Q7_OVERFLOW``: TX overflow counter for queue7
+ - ``FW_DROPPED_PKT``: This counter is incremented when a packet is dropped at PRU because of rule violation.
+ - ``FW_RX_ERROR``: Incremented if there was a CRC error or Min/Max frame error at PRU
+ - ``FW_RX_DS_INVALID``: Incremented when RTU detects Data Status invalid condition
+ - ``FW_TX_DROPPED_PACKET``: Counter for packets dropped via TX Port
+ - ``FW_TX_TS_DROPPED_PACKET``: Counter for packets with TS flag dropped via TX Port
+ - ``FW_INF_PORT_DISABLED``: Incremented when RX frame is dropped due to port being disabled
+ - ``FW_INF_SAV``: Incremented when RX frame is dropped due to Source Address violation
+ - ``FW_INF_SA_DL``: Incremented when RX frame is dropped due to Source Address being in the denylist
+ - ``FW_INF_PORT_BLOCKED``: Incremented when RX frame is dropped due to port being blocked and frame being a special frame
+ - ``FW_INF_DROP_TAGGED`` : Incremented when RX frame is dropped for being tagged
+ - ``FW_INF_DROP_PRIOTAGGED``: Incremented when RX frame is dropped for being priority tagged
+ - ``FW_INF_DROP_NOTAG``: Incremented when RX frame is dropped for being untagged
+ - ``FW_INF_DROP_NOTMEMBER``: Incremented when RX frame is dropped for port not being member of VLAN
+ - ``FW_RX_EOF_SHORT_FRMERR``: Incremented if End Of Frame (EOF) task is scheduled without seeing RX_B1
+ - ``FW_RX_B0_DROP_EARLY_EOF``: Incremented when frame is dropped due to Early EOF
+ - ``FW_TX_JUMBO_FRM_CUTOFF``: Incremented when frame is cut off to prevent packet size > 2000 Bytes
+ - ``FW_RX_EXP_FRAG_Q_DROP``: Incremented when express frame is received in the same queue as the previous fragment
+ - ``FW_RX_FIFO_OVERRUN``: RX fifo overrun counter
+ - ``FW_CUT_THR_PKT``: Incremented when a packet is forwarded using Cut-Through forwarding method
+ - ``FW_HOST_RX_PKT_CNT``: Number of valid packets sent by Rx PRU to Host on PSI
+ - ``FW_HOST_TX_PKT_CNT``: Number of valid packets copied by RTU0 to Tx queues
+ - ``FW_HOST_EGRESS_Q_PRE_OVERFLOW``: Host Egress Q (Pre-emptible) Overflow Counter
+ - ``FW_HOST_EGRESS_Q_EXP_OVERFLOW``: Host Egress Q (Pre-emptible) Overflow Counter
diff --git a/Documentation/networking/device_drivers/ethernet/wangxun/ngbevf.rst b/Documentation/networking/device_drivers/ethernet/wangxun/ngbevf.rst
new file mode 100644
index 000000000000..a39e3d5a1038
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/wangxun/ngbevf.rst
@@ -0,0 +1,16 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+==================================================================
+Linux Base Virtual Function Driver for Wangxun(R) Gigabit Ethernet
+==================================================================
+
+WangXun Gigabit Virtual Function Linux driver.
+Copyright(c) 2015 - 2025 Beijing WangXun Technology Co., Ltd.
+
+Support
+=======
+For general information, go to the website at:
+https://www.net-swift.com
+
+If you got any problem, contact Wangxun support team via nic-support@net-swift.com
+and Cc: netdev.
diff --git a/Documentation/networking/device_drivers/ethernet/wangxun/txgbevf.rst b/Documentation/networking/device_drivers/ethernet/wangxun/txgbevf.rst
new file mode 100644
index 000000000000..b2f759b7b518
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/wangxun/txgbevf.rst
@@ -0,0 +1,16 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+===========================================================================
+Linux Base Virtual Function Driver for Wangxun(R) 10/25/40 Gigabit Ethernet
+===========================================================================
+
+WangXun 10/25/40 Gigabit Virtual Function Linux driver.
+Copyright(c) 2015 - 2025 Beijing WangXun Technology Co., Ltd.
+
+Support
+=======
+For general information, go to the website at:
+https://www.net-swift.com
+
+If you got any problem, contact Wangxun support team via nic-support@net-swift.com
+and Cc: netdev.
diff --git a/Documentation/networking/devlink/devlink-info.rst b/Documentation/networking/devlink/devlink-info.rst
index 23073bc219d8..dd6adc4d0559 100644
--- a/Documentation/networking/devlink/devlink-info.rst
+++ b/Documentation/networking/devlink/devlink-info.rst
@@ -86,6 +86,10 @@ In case software/firmware components are loaded from the disk (e.g.
``/lib/firmware``) only the running version should be reported via
the kernel API.
+Please note that any security versions reported via devlink are purely
+informational. Devlink does not use a secure channel to communicate with
+the device.
+
Generic Versions
================
diff --git a/Documentation/networking/devlink/devlink-params.rst b/Documentation/networking/devlink/devlink-params.rst
index 4e01dc32bc08..211b58177e12 100644
--- a/Documentation/networking/devlink/devlink-params.rst
+++ b/Documentation/networking/devlink/devlink-params.rst
@@ -137,3 +137,9 @@ own name.
* - ``event_eq_size``
- u32
- Control the size of asynchronous control events EQ.
+ * - ``enable_phc``
+ - Boolean
+ - Enable PHC (PTP Hardware Clock) functionality in the device.
+ * - ``clock_id``
+ - u64
+ - Clock ID used by the device for registering DPLL devices and pins.
diff --git a/Documentation/networking/devlink/devlink-port.rst b/Documentation/networking/devlink/devlink-port.rst
index 9d22d41a7cd1..5e397798a402 100644
--- a/Documentation/networking/devlink/devlink-port.rst
+++ b/Documentation/networking/devlink/devlink-port.rst
@@ -418,6 +418,14 @@ API allows to configure following rate object's parameters:
to all node children limits. ``tx_max`` is an upper limit for children.
``tx_share`` is a total bandwidth distributed among children.
+``tc_bw``
+ Allow users to set the bandwidth allocation per traffic class on rate
+ objects. This enables fine-grained QoS configurations by assigning a relative
+ share value to each traffic class. The bandwidth is distributed in proportion
+ to the share value for each class, relative to the sum of all shares.
+ When applied to a non-leaf node, tc_bw determines how bandwidth is shared
+ among its child elements.
+
``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case
nodes with the same priority form a WFQ subgroup in the sibling group
and arbitration among them is based on assigned weights.
diff --git a/Documentation/networking/devlink/devlink-trap.rst b/Documentation/networking/devlink/devlink-trap.rst
index 2c14dfe69b3a..5885e21e2212 100644
--- a/Documentation/networking/devlink/devlink-trap.rst
+++ b/Documentation/networking/devlink/devlink-trap.rst
@@ -451,7 +451,7 @@ be added to the following table:
* - ``udp_parsing``
- ``drop``
- Traps packets dropped due to an error in the UDP header parsing.
- This packet trap could include checksum errorrs, an improper UDP
+ This packet trap could include checksum errors, an improper UDP
length detected (smaller than 8 bytes) or detection of header
truncation.
* - ``tcp_parsing``
diff --git a/Documentation/networking/devlink/index.rst b/Documentation/networking/devlink/index.rst
index 948c8c44e233..270a65a01411 100644
--- a/Documentation/networking/devlink/index.rst
+++ b/Documentation/networking/devlink/index.rst
@@ -84,6 +84,9 @@ parameters, info versions, and other features it supports.
i40e
ionic
ice
+ ixgbe
+ kvaser_pciefd
+ kvaser_usb
mlx4
mlx5
mlxsw
@@ -97,3 +100,4 @@ parameters, info versions, and other features it supports.
iosm
octeontx2
sfc
+ zl3073x
diff --git a/Documentation/networking/devlink/ixgbe.rst b/Documentation/networking/devlink/ixgbe.rst
new file mode 100644
index 000000000000..c27d1436c70e
--- /dev/null
+++ b/Documentation/networking/devlink/ixgbe.rst
@@ -0,0 +1,171 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+ixgbe devlink support
+=====================
+
+This document describes the devlink features implemented by the ``ixgbe``
+device driver.
+
+Info versions
+=============
+
+Any of the versions dealing with the security presented by ``devlink-info``
+is purely informational. Devlink does not use a secure channel to communicate
+with the device.
+
+The ``ixgbe`` driver reports the following versions
+
+.. list-table:: devlink info versions implemented
+ :widths: 5 5 5 90
+
+ * - Name
+ - Type
+ - Example
+ - Description
+ * - ``board.id``
+ - fixed
+ - H49289-000
+ - The Product Board Assembly (PBA) identifier of the board.
+ * - ``fw.undi``
+ - running
+ - 1.1937.0
+ - Version of the Option ROM containing the UEFI driver. The version is
+ reported in ``major.minor.patch`` format. The major version is
+ incremented whenever a major breaking change occurs, or when the
+ minor version would overflow. The minor version is incremented for
+ non-breaking changes and reset to 1 when the major version is
+ incremented. The patch version is normally 0 but is incremented when
+ a fix is delivered as a patch against an older base Option ROM.
+ * - ``fw.undi.srev``
+ - running
+ - 4
+ - Number indicating the security revision of the Option ROM.
+ * - ``fw.bundle_id``
+ - running
+ - 0x80000d0d
+ - Unique identifier of the firmware image file that was loaded onto
+ the device. Also referred to as the EETRACK identifier of the NVM.
+ * - ``fw.mgmt.api``
+ - running
+ - 1.5.1
+ - 3-digit version number (major.minor.patch) of the API exported over
+ the AdminQ by the management firmware. Used by the driver to
+ identify what commands are supported. Historical versions of the
+ kernel only displayed a 2-digit version number (major.minor).
+ * - ``fw.mgmt.build``
+ - running
+ - 0x305d955f
+ - Unique identifier of the source for the management firmware.
+ * - ``fw.mgmt.srev``
+ - running
+ - 3
+ - Number indicating the security revision of the firmware.
+ * - ``fw.psid.api``
+ - running
+ - 0.80
+ - Version defining the format of the flash contents.
+ * - ``fw.netlist``
+ - running
+ - 1.1.2000-6.7.0
+ - The version of the netlist module. This module defines the device's
+ Ethernet capabilities and default settings, and is used by the
+ management firmware as part of managing link and device
+ connectivity.
+ * - ``fw.netlist.build``
+ - running
+ - 0xee16ced7
+ - The first 4 bytes of the hash of the netlist module contents.
+
+Flash Update
+============
+
+The ``ixgbe`` driver implements support for flash update using the
+``devlink-flash`` interface. It supports updating the device flash using a
+combined flash image that contains the ``fw.mgmt``, ``fw.undi``, and
+``fw.netlist`` components.
+
+.. list-table:: List of supported overwrite modes
+ :widths: 5 95
+
+ * - Bits
+ - Behavior
+ * - ``DEVLINK_FLASH_OVERWRITE_SETTINGS``
+ - Do not preserve settings stored in the flash components being
+ updated. This includes overwriting the port configuration that
+ determines the number of physical functions the device will
+ initialize with.
+ * - ``DEVLINK_FLASH_OVERWRITE_SETTINGS`` and ``DEVLINK_FLASH_OVERWRITE_IDENTIFIERS``
+ - Do not preserve either settings or identifiers. Overwrite everything
+ in the flash with the contents from the provided image, without
+ performing any preservation. This includes overwriting device
+ identifying fields such as the MAC address, Vital product Data (VPD) area,
+ and device serial number. It is expected that this combination be used with an
+ image customized for the specific device.
+
+Reload
+======
+
+The ``ixgbe`` driver supports activating new firmware after a flash update
+using ``DEVLINK_CMD_RELOAD`` with the ``DEVLINK_RELOAD_ACTION_FW_ACTIVATE``
+action.
+
+.. code:: shell
+
+ $ devlink dev reload pci/0000:01:00.0 reload action fw_activate
+
+The new firmware is activated by issuing a device specific Embedded
+Management Processor reset which requests the device to reset and reload the
+EMP firmware image.
+
+The driver does not currently support reloading the driver via
+``DEVLINK_RELOAD_ACTION_DRIVER_REINIT``.
+
+Regions
+=======
+
+The ``ixgbe`` driver implements the following regions for accessing internal
+device data.
+
+.. list-table:: regions implemented
+ :widths: 15 85
+
+ * - Name
+ - Description
+ * - ``nvm-flash``
+ - The contents of the entire flash chip, sometimes referred to as
+ the device's Non Volatile Memory.
+ * - ``shadow-ram``
+ - The contents of the Shadow RAM, which is loaded from the beginning
+ of the flash. Although the contents are primarily from the flash,
+ this area also contains data generated during device boot which is
+ not stored in flash.
+ * - ``device-caps``
+ - The contents of the device firmware's capabilities buffer. Useful to
+ determine the current state and configuration of the device.
+
+Both the ``nvm-flash`` and ``shadow-ram`` regions can be accessed without a
+snapshot. The ``device-caps`` region requires a snapshot as the contents are
+sent by firmware and can't be split into separate reads.
+
+Users can request an immediate capture of a snapshot for all three regions
+via the ``DEVLINK_CMD_REGION_NEW`` command.
+
+.. code:: shell
+
+ $ devlink region show
+ pci/0000:01:00.0/nvm-flash: size 10485760 snapshot [] max 1
+ pci/0000:01:00.0/device-caps: size 4096 snapshot [] max 10
+
+ $ devlink region new pci/0000:01:00.0/nvm-flash snapshot 1
+
+ $ devlink region dump pci/0000:01:00.0/nvm-flash snapshot 1
+ 0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30
+ 0000000000000010 0000 0000 ffff ff04 0029 8c00 0028 8cc8
+ 0000000000000020 0016 0bb8 0016 1720 0000 0000 c00f 3ffc
+ 0000000000000030 bada cce5 bada cce5 bada cce5 bada cce5
+
+ $ devlink region read pci/0000:01:00.0/nvm-flash snapshot 1 address 0 length 16
+ 0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30
+
+ $ devlink region delete pci/0000:01:00.0/device-caps snapshot 1
diff --git a/Documentation/networking/devlink/kvaser_pciefd.rst b/Documentation/networking/devlink/kvaser_pciefd.rst
new file mode 100644
index 000000000000..075edd2a508a
--- /dev/null
+++ b/Documentation/networking/devlink/kvaser_pciefd.rst
@@ -0,0 +1,24 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============================
+kvaser_pciefd devlink support
+=============================
+
+This document describes the devlink features implemented by the
+``kvaser_pciefd`` device driver.
+
+Info versions
+=============
+
+The ``kvaser_pciefd`` driver reports the following versions
+
+.. list-table:: devlink info versions implemented
+ :widths: 5 5 90
+
+ * - Name
+ - Type
+ - Description
+ * - ``fw``
+ - running
+ - Version of the firmware running on the device. Also available
+ through ``ethtool -i`` as ``firmware-version``.
diff --git a/Documentation/networking/devlink/kvaser_usb.rst b/Documentation/networking/devlink/kvaser_usb.rst
new file mode 100644
index 000000000000..403db3766cb4
--- /dev/null
+++ b/Documentation/networking/devlink/kvaser_usb.rst
@@ -0,0 +1,33 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================
+kvaser_usb devlink support
+==========================
+
+This document describes the devlink features implemented by the
+``kvaser_usb`` device driver.
+
+Info versions
+=============
+
+The ``kvaser_usb`` driver reports the following versions
+
+.. list-table:: devlink info versions implemented
+ :widths: 5 5 90
+
+ * - Name
+ - Type
+ - Description
+ * - ``fw``
+ - running
+ - Version of the firmware running on the device. Also available
+ through ``ethtool -i`` as ``firmware-version``.
+ * - ``board.rev``
+ - fixed
+ - The device hardware revision.
+ * - ``board.id``
+ - fixed
+ - The device EAN (product number).
+ * - ``serial_number``
+ - fixed
+ - The device serial number.
diff --git a/Documentation/networking/devlink/netdevsim.rst b/Documentation/networking/devlink/netdevsim.rst
index 88482725422c..3932004eae82 100644
--- a/Documentation/networking/devlink/netdevsim.rst
+++ b/Documentation/networking/devlink/netdevsim.rst
@@ -62,7 +62,7 @@ Rate objects
The ``netdevsim`` driver supports rate objects management, which includes:
-- registerging/unregistering leaf rate objects per VF devlink port;
+- registering/unregistering leaf rate objects per VF devlink port;
- creation/deletion node rate objects;
- setting tx_share and tx_max rate values for any rate object type;
- setting parent node for any rate object type.
diff --git a/Documentation/networking/devlink/zl3073x.rst b/Documentation/networking/devlink/zl3073x.rst
new file mode 100644
index 000000000000..4b6cfaf38643
--- /dev/null
+++ b/Documentation/networking/devlink/zl3073x.rst
@@ -0,0 +1,51 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======================
+zl3073x devlink support
+=======================
+
+This document describes the devlink features implemented by the ``zl3073x``
+device driver.
+
+Parameters
+==========
+
+.. list-table:: Generic parameters implemented
+ :widths: 5 5 90
+
+ * - Name
+ - Mode
+ - Notes
+ * - ``clock_id``
+ - driverinit
+ - Set the clock ID that is used by the driver for registering DPLL devices
+ and pins.
+
+Info versions
+=============
+
+The ``zl3073x`` driver reports the following versions
+
+.. list-table:: devlink info versions implemented
+ :widths: 5 5 5 90
+
+ * - Name
+ - Type
+ - Example
+ - Description
+ * - ``asic.id``
+ - fixed
+ - 1E94
+ - Chip identification number
+ * - ``asic.rev``
+ - fixed
+ - 300
+ - Chip revision number
+ * - ``fw``
+ - running
+ - 7006
+ - Firmware version number
+ * - ``custom_cfg``
+ - running
+ - 1.3.0.1
+ - Device configuration version customized by OEM
diff --git a/Documentation/networking/devmem.rst b/Documentation/networking/devmem.rst
index eb678ca45496..a6cd7236bfbd 100644
--- a/Documentation/networking/devmem.rst
+++ b/Documentation/networking/devmem.rst
@@ -62,15 +62,15 @@ More Info
https://lore.kernel.org/netdev/20240831004313.3713467-1-almasrymina@google.com/
-Interface
-=========
+RX Interface
+============
Example
-------
-tools/testing/selftests/net/ncdevmem.c:do_server shows an example of setting up
-the RX path of this API.
+./tools/testing/selftests/drivers/net/hw/ncdevmem:do_server shows an example of
+setting up the RX path of this API.
NIC Setup
@@ -235,6 +235,148 @@ can be less than the tokens provided by the user in case of:
(a) an internal kernel leak bug.
(b) the user passed more than 1024 frags.
+TX Interface
+============
+
+
+Example
+-------
+
+./tools/testing/selftests/drivers/net/hw/ncdevmem:do_client shows an example of
+setting up the TX path of this API.
+
+
+NIC Setup
+---------
+
+The user must bind a TX dmabuf to a given NIC using the netlink API::
+
+ struct netdev_bind_tx_req *req = NULL;
+ struct netdev_bind_tx_rsp *rsp = NULL;
+ struct ynl_error yerr;
+
+ *ys = ynl_sock_create(&ynl_netdev_family, &yerr);
+
+ req = netdev_bind_tx_req_alloc();
+ netdev_bind_tx_req_set_ifindex(req, ifindex);
+ netdev_bind_tx_req_set_fd(req, dmabuf_fd);
+
+ rsp = netdev_bind_tx(*ys, req);
+
+ tx_dmabuf_id = rsp->id;
+
+
+The netlink API returns a dmabuf_id: a unique ID that refers to this dmabuf
+that has been bound.
+
+The user can unbind the dmabuf from the netdevice by closing the netlink socket
+that established the binding. We do this so that the binding is automatically
+unbound even if the userspace process crashes.
+
+Note that any reasonably well-behaved dmabuf from any exporter should work with
+devmem TCP, even if the dmabuf is not actually backed by devmem. An example of
+this is udmabuf, which wraps user memory (non-devmem) in a dmabuf.
+
+Socket Setup
+------------
+
+The user application must use MSG_ZEROCOPY flag when sending devmem TCP. Devmem
+cannot be copied by the kernel, so the semantics of the devmem TX are similar
+to the semantics of MSG_ZEROCOPY::
+
+ setsockopt(socket_fd, SOL_SOCKET, SO_ZEROCOPY, &opt, sizeof(opt));
+
+It is also recommended that the user binds the TX socket to the same interface
+the dma-buf has been bound to via SO_BINDTODEVICE::
+
+ setsockopt(socket_fd, SOL_SOCKET, SO_BINDTODEVICE, ifname, strlen(ifname) + 1);
+
+
+Sending data
+------------
+
+Devmem data is sent using the SCM_DEVMEM_DMABUF cmsg.
+
+The user should create a msghdr where,
+
+* iov_base is set to the offset into the dmabuf to start sending from
+* iov_len is set to the number of bytes to be sent from the dmabuf
+
+The user passes the dma-buf id to send from via the dmabuf_tx_cmsg.dmabuf_id.
+
+The example below sends 1024 bytes from offset 100 into the dmabuf, and 2048
+from offset 2000 into the dmabuf. The dmabuf to send from is tx_dmabuf_id::
+
+ char ctrl_data[CMSG_SPACE(sizeof(struct dmabuf_tx_cmsg))];
+ struct dmabuf_tx_cmsg ddmabuf;
+ struct msghdr msg = {};
+ struct cmsghdr *cmsg;
+ struct iovec iov[2];
+
+ iov[0].iov_base = (void*)100;
+ iov[0].iov_len = 1024;
+ iov[1].iov_base = (void*)2000;
+ iov[1].iov_len = 2048;
+
+ msg.msg_iov = iov;
+ msg.msg_iovlen = 2;
+
+ msg.msg_control = ctrl_data;
+ msg.msg_controllen = sizeof(ctrl_data);
+
+ cmsg = CMSG_FIRSTHDR(&msg);
+ cmsg->cmsg_level = SOL_SOCKET;
+ cmsg->cmsg_type = SCM_DEVMEM_DMABUF;
+ cmsg->cmsg_len = CMSG_LEN(sizeof(struct dmabuf_tx_cmsg));
+
+ ddmabuf.dmabuf_id = tx_dmabuf_id;
+
+ *((struct dmabuf_tx_cmsg *)CMSG_DATA(cmsg)) = ddmabuf;
+
+ sendmsg(socket_fd, &msg, MSG_ZEROCOPY);
+
+
+Reusing TX dmabufs
+------------------
+
+Similar to MSG_ZEROCOPY with regular memory, the user should not modify the
+contents of the dma-buf while a send operation is in progress. This is because
+the kernel does not keep a copy of the dmabuf contents. Instead, the kernel
+will pin and send data from the buffer available to the userspace.
+
+Just as in MSG_ZEROCOPY, the kernel notifies the userspace of send completions
+using MSG_ERRQUEUE::
+
+ int64_t tstop = gettimeofday_ms() + waittime_ms;
+ char control[CMSG_SPACE(100)] = {};
+ struct sock_extended_err *serr;
+ struct msghdr msg = {};
+ struct cmsghdr *cm;
+ int retries = 10;
+ __u32 hi, lo;
+
+ msg.msg_control = control;
+ msg.msg_controllen = sizeof(control);
+
+ while (gettimeofday_ms() < tstop) {
+ if (!do_poll(fd)) continue;
+
+ ret = recvmsg(fd, &msg, MSG_ERRQUEUE);
+
+ for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) {
+ serr = (void *)CMSG_DATA(cm);
+
+ hi = serr->ee_data;
+ lo = serr->ee_info;
+
+ fprintf(stdout, "tx complete [%d,%d]\n", lo, hi);
+ }
+ }
+
+After the associated sendmsg has been completed, the dmabuf can be reused by
+the userspace.
+
+
Implementation & Caveats
========================
diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst
index b6e9af4d0f1b..ab20c644af24 100644
--- a/Documentation/networking/ethtool-netlink.rst
+++ b/Documentation/networking/ethtool-netlink.rst
@@ -239,6 +239,9 @@ Userspace to kernel:
``ETHTOOL_MSG_PHY_GET`` get Ethernet PHY information
``ETHTOOL_MSG_TSCONFIG_GET`` get hw timestamping configuration
``ETHTOOL_MSG_TSCONFIG_SET`` set hw timestamping configuration
+ ``ETHTOOL_MSG_RSS_SET`` set RSS settings
+ ``ETHTOOL_MSG_RSS_CREATE_ACT`` create an additional RSS context
+ ``ETHTOOL_MSG_RSS_DELETE_ACT`` delete an additional RSS context
===================================== =================================
Kernel to userspace:
@@ -281,6 +284,7 @@ Kernel to userspace:
``ETHTOOL_MSG_MODULE_GET_REPLY`` transceiver module parameters
``ETHTOOL_MSG_PSE_GET_REPLY`` PSE parameters
``ETHTOOL_MSG_RSS_GET_REPLY`` RSS settings
+ ``ETHTOOL_MSG_RSS_NTF`` RSS settings
``ETHTOOL_MSG_PLCA_GET_CFG_REPLY`` PLCA RS parameters
``ETHTOOL_MSG_PLCA_GET_STATUS_REPLY`` PLCA RS status
``ETHTOOL_MSG_PLCA_NTF`` PLCA RS parameters
@@ -290,6 +294,11 @@ Kernel to userspace:
``ETHTOOL_MSG_PHY_NTF`` Ethernet PHY information change
``ETHTOOL_MSG_TSCONFIG_GET_REPLY`` hw timestamping configuration
``ETHTOOL_MSG_TSCONFIG_SET_REPLY`` new hw timestamping configuration
+ ``ETHTOOL_MSG_PSE_NTF`` PSE events notification
+ ``ETHTOOL_MSG_RSS_NTF`` RSS settings notification
+ ``ETHTOOL_MSG_RSS_CREATE_ACT_REPLY`` create an additional RSS context
+ ``ETHTOOL_MSG_RSS_CREATE_NTF`` additional RSS context created
+ ``ETHTOOL_MSG_RSS_DELETE_NTF`` additional RSS context deleted
======================================== =================================
``GET`` requests are sent by userspace applications to retrieve device
@@ -1788,6 +1797,11 @@ Kernel response contents:
limit of the PoE PSE.
``ETHTOOL_A_C33_PSE_PW_LIMIT_RANGES`` nested Supported power limit
configuration ranges.
+ ``ETHTOOL_A_PSE_PW_D_ID`` u32 Index of the PSE power domain
+ ``ETHTOOL_A_PSE_PRIO_MAX`` u32 Priority maximum configurable
+ on the PoE PSE
+ ``ETHTOOL_A_PSE_PRIO`` u32 Priority of the PoE PSE
+ currently configured
========================================== ====== =============================
When set, the optional ``ETHTOOL_A_PODL_PSE_ADMIN_STATE`` attribute identifies
@@ -1861,6 +1875,15 @@ identifies the C33 PSE power limit ranges through
If the controller works with fixed classes, the min and max values will be
equal.
+The ``ETHTOOL_A_PSE_PW_D_ID`` attribute identifies the index of PSE power
+domain.
+
+When set, the optional ``ETHTOOL_A_PSE_PRIO_MAX`` attribute identifies
+the PSE maximum priority value.
+When set, the optional ``ETHTOOL_A_PSE_PRIO`` attributes is used to
+identifies the currently configured PSE priority.
+For a description of PSE priority attributes, see ``PSE_SET``.
+
PSE_SET
=======
@@ -1874,6 +1897,8 @@ Request contents:
``ETHTOOL_A_C33_PSE_ADMIN_CONTROL`` u32 Control PSE Admin state
``ETHTOOL_A_C33_PSE_AVAIL_PWR_LIMIT`` u32 Control PoE PSE available
power limit
+ ``ETHTOOL_A_PSE_PRIO`` u32 Control priority of the
+ PoE PSE
====================================== ====== =============================
When set, the optional ``ETHTOOL_A_PODL_PSE_ADMIN_CONTROL`` attribute is used
@@ -1896,6 +1921,38 @@ various existing products that document power consumption in watts rather than
classes. If power limit configuration based on classes is needed, the
conversion can be done in user space, for example by ethtool.
+When set, the optional ``ETHTOOL_A_PSE_PRIO`` attributes is used to
+control the PSE priority. Allowed priority value are between zero and
+the value of ``ETHTOOL_A_PSE_PRIO_MAX`` attribute.
+
+A lower value indicates a higher priority, meaning that a priority value
+of 0 corresponds to the highest port priority.
+Port priority serves two functions:
+
+ - Power-up Order: After a reset, ports are powered up in order of their
+ priority from highest to lowest. Ports with higher priority
+ (lower values) power up first.
+ - Shutdown Order: When the power budget is exceeded, ports with lower
+ priority (higher values) are turned off first.
+
+PSE_NTF
+=======
+
+Notify PSE events.
+
+Notification contents:
+
+ =============================== ====== ========================
+ ``ETHTOOL_A_PSE_HEADER`` nested request header
+ ``ETHTOOL_A_PSE_EVENTS`` bitset PSE events
+ =============================== ====== ========================
+
+When set, the optional ``ETHTOOL_A_PSE_EVENTS`` attribute identifies the
+PSE events.
+
+.. kernel-doc:: include/uapi/linux/ethtool_netlink_generated.h
+ :identifiers: ethtool_pse_event
+
RSS_GET
=======
@@ -1919,14 +1976,15 @@ used to ignore context 0s and only dump additional contexts).
Kernel response contents:
-===================================== ====== ==========================
+===================================== ====== ===============================
``ETHTOOL_A_RSS_HEADER`` nested reply header
``ETHTOOL_A_RSS_CONTEXT`` u32 context number
``ETHTOOL_A_RSS_HFUNC`` u32 RSS hash func
``ETHTOOL_A_RSS_INDIR`` binary Indir table bytes
``ETHTOOL_A_RSS_HKEY`` binary Hash key bytes
``ETHTOOL_A_RSS_INPUT_XFRM`` u32 RSS input data transformation
-===================================== ====== ==========================
+ ``ETHTOOL_A_RSS_FLOW_HASH`` nested Header fields included in hash
+===================================== ====== ===============================
ETHTOOL_A_RSS_HFUNC attribute is bitmap indicating the hash function
being used. Current supported options are toeplitz, xor or crc32.
@@ -1935,6 +1993,67 @@ indicates queue number.
ETHTOOL_A_RSS_INPUT_XFRM attribute is a bitmap indicating the type of
transformation applied to the input protocol fields before given to the RSS
hfunc. Current supported options are symmetric-xor and symmetric-or-xor.
+ETHTOOL_A_RSS_FLOW_HASH carries per-flow type bitmask of which header
+fields are included in the hash calculation.
+
+RSS_SET
+=======
+
+Request contents:
+
+===================================== ====== ==============================
+ ``ETHTOOL_A_RSS_HEADER`` nested request header
+ ``ETHTOOL_A_RSS_CONTEXT`` u32 context number
+ ``ETHTOOL_A_RSS_HFUNC`` u32 RSS hash func
+ ``ETHTOOL_A_RSS_INDIR`` binary Indir table bytes
+ ``ETHTOOL_A_RSS_HKEY`` binary Hash key bytes
+ ``ETHTOOL_A_RSS_INPUT_XFRM`` u32 RSS input data transformation
+ ``ETHTOOL_A_RSS_FLOW_HASH`` nested Header fields included in hash
+===================================== ====== ==============================
+
+``ETHTOOL_A_RSS_INDIR`` is the minimal RSS table the user expects. Kernel and
+the device driver may replicate the table if its smaller than smallest table
+size supported by the device. For example if user requests ``[0, 1]`` but the
+device needs at least 8 entries - the real table in use will end up being
+``[0, 1, 0, 1, 0, 1, 0, 1]``. Most devices require the table size to be power
+of 2, so tables which size is not a power of 2 will likely be rejected.
+Using table of size 0 will reset the indirection table to the default.
+
+RSS_CREATE_ACT
+==============
+
+Request contents:
+
+===================================== ====== ==============================
+ ``ETHTOOL_A_RSS_HEADER`` nested request header
+ ``ETHTOOL_A_RSS_CONTEXT`` u32 context number
+ ``ETHTOOL_A_RSS_HFUNC`` u32 RSS hash func
+ ``ETHTOOL_A_RSS_INDIR`` binary Indir table bytes
+ ``ETHTOOL_A_RSS_HKEY`` binary Hash key bytes
+ ``ETHTOOL_A_RSS_INPUT_XFRM`` u32 RSS input data transformation
+===================================== ====== ==============================
+
+Kernel response contents:
+
+===================================== ====== ==============================
+ ``ETHTOOL_A_RSS_HEADER`` nested request header
+ ``ETHTOOL_A_RSS_CONTEXT`` u32 context number
+===================================== ====== ==============================
+
+Create an additional RSS context, if ``ETHTOOL_A_RSS_CONTEXT`` is not
+specified kernel will allocate one automatically.
+
+RSS_DELETE_ACT
+==============
+
+Request contents:
+
+===================================== ====== ==============================
+ ``ETHTOOL_A_RSS_HEADER`` nested request header
+ ``ETHTOOL_A_RSS_CONTEXT`` u32 context number
+===================================== ====== ==============================
+
+Delete an additional RSS context.
PLCA_GET_CFG
============
@@ -2386,8 +2505,8 @@ are netlink only.
``ETHTOOL_SFLAGS`` ``ETHTOOL_MSG_FEATURES_SET``
``ETHTOOL_GPFLAGS`` ``ETHTOOL_MSG_PRIVFLAGS_GET``
``ETHTOOL_SPFLAGS`` ``ETHTOOL_MSG_PRIVFLAGS_SET``
- ``ETHTOOL_GRXFH`` n/a
- ``ETHTOOL_SRXFH`` n/a
+ ``ETHTOOL_GRXFH`` ``ETHTOOL_MSG_RSS_GET``
+ ``ETHTOOL_SRXFH`` ``ETHTOOL_MSG_RSS_SET``
``ETHTOOL_GGRO`` ``ETHTOOL_MSG_FEATURES_GET``
``ETHTOOL_SGRO`` ``ETHTOOL_MSG_FEATURES_SET``
``ETHTOOL_GRXRINGS`` n/a
@@ -2401,8 +2520,8 @@ are netlink only.
``ETHTOOL_SRXNTUPLE`` n/a
``ETHTOOL_GRXNTUPLE`` n/a
``ETHTOOL_GSSET_INFO`` ``ETHTOOL_MSG_STRSET_GET``
- ``ETHTOOL_GRXFHINDIR`` n/a
- ``ETHTOOL_SRXFHINDIR`` n/a
+ ``ETHTOOL_GRXFHINDIR`` ``ETHTOOL_MSG_RSS_GET``
+ ``ETHTOOL_SRXFHINDIR`` ``ETHTOOL_MSG_RSS_SET``
``ETHTOOL_GFEATURES`` ``ETHTOOL_MSG_FEATURES_GET``
``ETHTOOL_SFEATURES`` ``ETHTOOL_MSG_FEATURES_SET``
``ETHTOOL_GCHANNELS`` ``ETHTOOL_MSG_CHANNELS_GET``
diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
index c64133d309bf..ac90b82f3ce9 100644
--- a/Documentation/networking/index.rst
+++ b/Documentation/networking/index.rst
@@ -48,7 +48,6 @@ Contents:
ax25
bonding
cdc_mbim
- dccp
dctcp
devmem
dns_resolver
diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index 5c63ab928b97..bb620f554598 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -8,15 +8,19 @@ IP Sysctl
==============================
ip_forward - BOOLEAN
- - 0 - disabled (default)
- - not 0 - enabled
-
Forward Packets between interfaces.
This variable is special, its change resets all configuration
parameters to their default state (RFC1122 for hosts, RFC1812
for routers)
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 0 (disabled)
+
ip_default_ttl - INTEGER
Default value of TTL field (Time To Live) for outgoing (but not
forwarded) IP packets. Should be between 1 and 255 inclusive.
@@ -37,8 +41,8 @@ ip_no_pmtu_disc - INTEGER
Mode 3 is a hardened pmtu discover mode. The kernel will only
accept fragmentation-needed errors if the underlying protocol
can verify them besides a plain socket lookup. Current
- protocols for which pmtu events will be honored are TCP, SCTP
- and DCCP as they verify e.g. the sequence number or the
+ protocols for which pmtu events will be honored are TCP and
+ SCTP as they verify e.g. the sequence number or the
association. This mode should not be enabled globally but is
only intended to secure e.g. name servers in namespaces where
TCP path mtu must still work but path MTU information of other
@@ -62,20 +66,25 @@ ip_forward_use_pmtu - BOOLEAN
kernel honoring this information. This is normally not the
case.
- Default: 0 (disabled)
-
Possible values:
- - 0 - disabled
- - 1 - enabled
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 0 (disabled)
fwmark_reflect - BOOLEAN
Controls the fwmark of kernel-generated IPv4 reply packets that are not
associated with a socket for example, TCP RSTs or ICMP echo replies).
- If unset, these packets have a fwmark of zero. If set, they have the
+ If disabled, these packets have a fwmark of zero. If enabled, they have the
fwmark of the packet they are replying to.
- Default: 0
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 0 (disabled)
fib_multipath_use_neigh - BOOLEAN
Use status of existing neighbor entry when determining nexthop for
@@ -83,12 +92,12 @@ fib_multipath_use_neigh - BOOLEAN
packets could be directed to a failed nexthop. Only valid for kernels
built with CONFIG_IP_ROUTE_MULTIPATH enabled.
- Default: 0 (disabled)
-
Possible values:
- - 0 - disabled
- - 1 - enabled
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 0 (disabled)
fib_multipath_hash_policy - INTEGER
Controls which hash policy to use for multipath routes. Only valid
@@ -368,7 +377,12 @@ tcp_autocorking - BOOLEAN
queue. Applications can still use TCP_CORK for optimal behavior
when they know how/when to uncork their sockets.
- Default : 1
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 1 (enabled)
tcp_available_congestion_control - STRING
Shows the available congestion control choices that are registered.
@@ -408,9 +422,16 @@ tcp_congestion_control - STRING
tcp_dsack - BOOLEAN
Allows TCP to send "duplicate" SACKs.
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 1 (enabled)
+
tcp_early_retrans - INTEGER
Tail loss probe (TLP) converts RTOs occurring due to tail
- losses into fast recovery (draft-ietf-tcpm-rack). Note that
+ losses into fast recovery (RFC8985). Note that
TLP requires RACK to function properly (see tcp_recovery below)
Possible values:
@@ -447,7 +468,12 @@ tcp_ecn_fallback - BOOLEAN
knob. The value is not used, if tcp_ecn or per route (or congestion
control) ECN settings are disabled.
- Default: 1 (fallback enabled)
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 1 (enabled)
tcp_fack - BOOLEAN
This is a legacy option, it has no effect anymore.
@@ -474,7 +500,7 @@ tcp_frto - INTEGER
By default it's enabled with a non-zero value. 0 disables F-RTO.
tcp_fwmark_accept - BOOLEAN
- If set, incoming connections to listening sockets that do not have a
+ If enabled, incoming connections to listening sockets that do not have a
socket mark will set the mark of the accepting socket to the fwmark of
the incoming SYN packet. This will cause all packets on that connection
(starting from the first SYNACK) to be sent with that fwmark. The
@@ -482,7 +508,12 @@ tcp_fwmark_accept - BOOLEAN
have a fwmark set via setsockopt(SOL_SOCKET, SO_MARK, ...) are
unaffected.
- Default: 0
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 0 (disabled)
tcp_invalid_ratelimit - INTEGER
Limit the maximal rate for sending duplicate acknowledgments
@@ -528,6 +559,11 @@ tcp_l3mdev_accept - BOOLEAN
which the packets originated. Only valid when the kernel was
compiled with CONFIG_NET_L3_MASTER_DEV.
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
Default: 0 (disabled)
tcp_low_latency - BOOLEAN
@@ -593,10 +629,16 @@ tcp_min_rtt_wlen - INTEGER
Default: 300
tcp_moderate_rcvbuf - BOOLEAN
- If set, TCP performs receive buffer auto-tuning, attempting to
+ If enabled, TCP performs receive buffer auto-tuning, attempting to
automatically size the buffer (no greater than tcp_rmem[2]) to
- match the size required by the path for full throughput. Enabled by
- default.
+ match the size required by the path for full throughput.
+
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 1 (enabled)
tcp_mtu_probing - INTEGER
Controls TCP Packetization-Layer Path MTU Discovery. Takes three
@@ -621,13 +663,26 @@ tcp_no_metrics_save - BOOLEAN
when the connection closes, so that connections established in the
near future can use these to set initial conditions. Usually, this
increases overall performance, but may sometimes cause performance
- degradation. If set, TCP will not cache metrics on closing
+ degradation. If enabled, TCP will not cache metrics on closing
connections.
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 0 (disabled)
+
tcp_no_ssthresh_metrics_save - BOOLEAN
Controls whether TCP saves ssthresh metrics in the route cache.
+ If enabled, ssthresh metrics are disabled.
+
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
- Default is 1, which disables ssthresh metrics.
+ Default: 1 (enabled)
tcp_orphan_retries - INTEGER
This value influences the timeout of a locally closed TCP connection,
@@ -645,9 +700,11 @@ tcp_recovery - INTEGER
features.
========= =============================================================
- RACK: 0x1 enables the RACK loss detection for fast detection of lost
- retransmissions and tail drops. It also subsumes and disables
- RFC6675 recovery for SACK connections.
+ RACK: 0x1 enables RACK loss detection, for fast detection of lost
+ retransmissions and tail drops, and resilience to
+ reordering. currently, setting this bit to 0 has no
+ effect, since RACK is the only supported loss detection
+ algorithm.
RACK: 0x2 makes RACK's reordering window static (min_rtt/4).
@@ -664,6 +721,11 @@ tcp_reflect_tos - BOOLEAN
This options affects both IPv4 and IPv6.
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
Default: 0 (disabled)
tcp_reordering - INTEGER
@@ -685,6 +747,13 @@ tcp_retrans_collapse - BOOLEAN
On retransmit try to send bigger packets to work around bugs in
certain TCP stacks.
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 1 (enabled)
+
tcp_retries1 - INTEGER
This value influences the time, after which TCP decides, that
something is wrong due to unacknowledged RTO retransmissions,
@@ -712,11 +781,16 @@ tcp_retries2 - INTEGER
which corresponds to a value of at least 8.
tcp_rfc1337 - BOOLEAN
- If set, the TCP stack behaves conforming to RFC1337. If unset,
+ If enabled, the TCP stack behaves conforming to RFC1337. If unset,
we are not conforming to RFC, but prevent TCP TIME_WAIT
assassination.
- Default: 0
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 0 (disabled)
tcp_rmem - vector of 3 INTEGERs: min, default, max
min: Minimal size of receive buffer used by TCP sockets.
@@ -735,11 +809,18 @@ tcp_rmem - vector of 3 INTEGERs: min, default, max
net.core.rmem_max. Calling setsockopt() with SO_RCVBUF disables
automatic tuning of that socket's receive buffer size, in which
case this value is ignored.
- Default: between 131072 and 6MB, depending on RAM size.
+ Default: between 131072 and 32MB, depending on RAM size.
tcp_sack - BOOLEAN
Enable select acknowledgments (SACKS).
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 1 (enabled)
+
tcp_comp_sack_delay_ns - LONG INTEGER
TCP tries to reduce number of SACK sent, using a timer
based on 5% of SRTT, capped by this sysctl, in nano seconds.
@@ -762,26 +843,41 @@ tcp_comp_sack_nr - INTEGER
Default : 44
tcp_backlog_ack_defer - BOOLEAN
- If set, user thread processing socket backlog tries sending
+ If enabled, user thread processing socket backlog tries sending
one ACK for the whole queue. This helps to avoid potential
long latencies at end of a TCP socket syscall.
- Default : true
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 1 (enabled)
tcp_slow_start_after_idle - BOOLEAN
- If set, provide RFC2861 behavior and time out the congestion
+ If enabled, provide RFC2861 behavior and time out the congestion
window after an idle period. An idle period is defined at
the current RTO. If unset, the congestion window will not
be timed out after an idle period.
- Default: 1
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 1 (enabled)
tcp_stdurg - BOOLEAN
Use the Host requirements interpretation of the TCP urgent pointer field.
- Most hosts use the older BSD interpretation, so if you turn this on
+ Most hosts use the older BSD interpretation, so if enabled,
Linux might not communicate correctly with them.
- Default: FALSE
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 0 (disabled)
tcp_synack_retries - INTEGER
Number of times SYNACKs for a passive TCP connection attempt will
@@ -838,7 +934,12 @@ tcp_migrate_req - BOOLEAN
migration by returning SK_DROP in the type of eBPF program, or
disable this option.
- Default: 0
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 0 (disabled)
tcp_fastopen - INTEGER
Enable TCP Fast Open (RFC7413) to send and accept data in the opening
@@ -1019,6 +1120,13 @@ tcp_tw_reuse_delay - UNSIGNED INTEGER
tcp_window_scaling - BOOLEAN
Enable window scaling as defined in RFC1323.
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 1 (enabled)
+
tcp_shrink_window - BOOLEAN
This changes how the TCP receive window is calculated.
@@ -1026,13 +1134,15 @@ tcp_shrink_window - BOOLEAN
window can be offered, and that TCP implementations MUST ensure
that they handle a shrinking window, as specified in RFC 1122.
- - 0 - Disabled. The window is never shrunk.
- - 1 - Enabled. The window is shrunk when necessary to remain within
- the memory limit set by autotuning (sk_rcvbuf).
- This only occurs if a non-zero receive window
- scaling factor is also in effect.
+ Possible values:
+
+ - 0 (disabled) - The window is never shrunk.
+ - 1 (enabled) - The window is shrunk when necessary to remain within
+ the memory limit set by autotuning (sk_rcvbuf).
+ This only occurs if a non-zero receive window
+ scaling factor is also in effect.
- Default: 0
+ Default: 0 (disabled)
tcp_wmem - vector of 3 INTEGERs: min, default, max
min: Amount of memory reserved for send buffers for TCP sockets.
@@ -1069,16 +1179,21 @@ tcp_notsent_lowat - UNSIGNED INTEGER
Default: UINT_MAX (0xFFFFFFFF)
tcp_workaround_signed_windows - BOOLEAN
- If set, assume no receipt of a window scaling option means the
+ If enabled, assume no receipt of a window scaling option means the
remote TCP is broken and treats the window as a signed quantity.
- If unset, assume the remote TCP is not broken even if we do
+ If disabled, assume the remote TCP is not broken even if we do
not receive a window scaling option from them.
- Default: 0
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 0 (disabled)
tcp_thin_linear_timeouts - BOOLEAN
Enable dynamic triggering of linear timeouts for thin streams.
- If set, a check is performed upon retransmission by timeout to
+ If enabled, a check is performed upon retransmission by timeout to
determine if the stream is thin (less than 4 packets in flight).
As long as the stream is found to be thin, up to 6 linear
timeouts may be performed before exponential backoff mode is
@@ -1087,7 +1202,12 @@ tcp_thin_linear_timeouts - BOOLEAN
For more information on thin streams, see
Documentation/networking/tcp-thin.rst
- Default: 0
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 0 (disabled)
tcp_limit_output_bytes - INTEGER
Controls TCP Small Queue limit per tcp socket.
@@ -1099,7 +1219,7 @@ tcp_limit_output_bytes - INTEGER
limits the number of bytes on qdisc or device to reduce artificial
RTT/cwnd and reduce bufferbloat.
- Default: 1048576 (16 * 65536)
+ Default: 4194304 (4 MB)
tcp_challenge_ack_limit - INTEGER
Limits number of Challenge ACK sent per second, as recommended
@@ -1139,7 +1259,7 @@ tcp_child_ehash_entries - INTEGER
Default: 0
tcp_plb_enabled - BOOLEAN
- If set and the underlying congestion control (e.g. DCTCP) supports
+ If enabled and the underlying congestion control (e.g. DCTCP) supports
and enables PLB feature, TCP PLB (Protective Load Balancing) is
enabled. PLB is described in the following paper:
https://doi.org/10.1145/3544216.3544226. Based on PLB parameters,
@@ -1155,12 +1275,17 @@ tcp_plb_enabled - BOOLEAN
by switches to determine next hop. In either case, further host
and switch side changes will be needed.
- When set, PLB assumes that congestion signal (e.g. ECN) is made
+ If enabled, PLB assumes that congestion signal (e.g. ECN) is made
available and used by congestion control module to estimate a
congestion measure (e.g. ce_ratio). PLB needs a congestion measure to
make repathing decisions.
- Default: FALSE
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 0 (disabled)
tcp_plb_idle_rehash_rounds - INTEGER
Number of consecutive congested rounds (RTT) seen after which
@@ -1260,6 +1385,11 @@ udp_l3mdev_accept - BOOLEAN
originated. Only valid when the kernel was compiled with
CONFIG_NET_L3_MASTER_DEV.
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
Default: 0 (disabled)
udp_mem - vector of 3 INTEGERs: min, pressure, max
@@ -1320,19 +1450,29 @@ raw_l3mdev_accept - BOOLEAN
originated. Only valid when the kernel was compiled with
CONFIG_NET_L3_MASTER_DEV.
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
Default: 1 (enabled)
CIPSOv4 Variables
=================
cipso_cache_enable - BOOLEAN
- If set, enable additions to and lookups from the CIPSO label mapping
- cache. If unset, additions are ignored and lookups always result in a
+ If enabled, enable additions to and lookups from the CIPSO label mapping
+ cache. If disabled, additions are ignored and lookups always result in a
miss. However, regardless of the setting the cache is still
invalidated when required when means you can safely toggle this on and
off and the cache will always be "safe".
- Default: 1
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 1 (enabled)
cipso_cache_bucket_size - INTEGER
The CIPSO label cache consists of a fixed size hash table with each
@@ -1350,17 +1490,27 @@ cipso_rbm_optfmt - BOOLEAN
This means that when set the CIPSO tag will be padded with empty
categories in order to make the packet data 32-bit aligned.
- Default: 0
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
-cipso_rbm_structvalid - BOOLEAN
- If set, do a very strict check of the CIPSO option when
- ip_options_compile() is called. If unset, relax the checks done during
+ Default: 0 (disabled)
+
+cipso_rbm_strictvalid - BOOLEAN
+ If enabled, do a very strict check of the CIPSO option when
+ ip_options_compile() is called. If disabled, relax the checks done during
ip_options_compile(). Either way is "safe" as errors are caught else
where in the CIPSO processing code but setting this to 0 (False) should
result in less work (i.e. it should be faster) but could cause problems
with other implementations that require strict checking.
- Default: 0
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 0 (disabled)
IP Variables
============
@@ -1417,10 +1567,15 @@ ip_unprivileged_port_start - INTEGER
Default: 1024
ip_nonlocal_bind - BOOLEAN
- If set, allows processes to bind() to non-local IP addresses,
+ If enabled, allows processes to bind() to non-local IP addresses,
which can be quite useful - but may break some applications.
- Default: 0
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 0 (disabled)
ip_autobind_reuse - BOOLEAN
By default, bind() does not select the ports automatically even if
@@ -1429,7 +1584,13 @@ ip_autobind_reuse - BOOLEAN
when you use bind()+connect(), but may break some applications.
The preferred solution is to use IP_BIND_ADDRESS_NO_PORT and this
option should only be set by experts.
- Default: 0
+
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 0 (disabled)
ip_dynaddr - INTEGER
If set non-zero, enables support for dynamic addresses.
@@ -1447,7 +1608,12 @@ ip_early_demux - BOOLEAN
It may add an additional cost for pure routing workloads that
reduces overall throughput, in such case you should disable it.
- Default: 1
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 1 (enabled)
ping_group_range - 2 INTEGERS
Restrict ICMP_PROTO datagram sockets to users in the group range.
@@ -1459,31 +1625,56 @@ ping_group_range - 2 INTEGERS
tcp_early_demux - BOOLEAN
Enable early demux for established TCP sockets.
- Default: 1
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 1 (enabled)
udp_early_demux - BOOLEAN
Enable early demux for connected UDP sockets. Disable this if
your system could experience more unconnected load.
- Default: 1
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 1 (enabled)
icmp_echo_ignore_all - BOOLEAN
- If set non-zero, then the kernel will ignore all ICMP ECHO
+ If enabled, then the kernel will ignore all ICMP ECHO
requests sent to it.
- Default: 0
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 0 (disabled)
icmp_echo_enable_probe - BOOLEAN
- If set to one, then the kernel will respond to RFC 8335 PROBE
+ If enabled, then the kernel will respond to RFC 8335 PROBE
requests sent to it.
- Default: 0
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 0 (disabled)
icmp_echo_ignore_broadcasts - BOOLEAN
- If set non-zero, then the kernel will ignore all ICMP ECHO and
+ If enabled, then the kernel will ignore all ICMP ECHO and
TIMESTAMP requests sent to it via broadcast/multicast.
- Default: 1
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 1 (enabled)
icmp_ratelimit - INTEGER
Limit the maximal rates for sending ICMP packets whose type matches
@@ -1540,17 +1731,22 @@ icmp_ratemask - INTEGER
icmp_ignore_bogus_error_responses - BOOLEAN
Some routers violate RFC1122 by sending bogus responses to broadcast
frames. Such violations are normally logged via a kernel warning.
- If this is set to TRUE, the kernel will not give such warnings, which
+ If enabled, the kernel will not give such warnings, which
will avoid log file clutter.
- Default: 1
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 1 (enabled)
icmp_errors_use_inbound_ifaddr - BOOLEAN
- If zero, icmp error messages are sent with the primary address of
+ If disabled, icmp error messages are sent with the primary address of
the exiting interface.
- If non-zero, the message will be sent with the primary address of
+ If enabled, the message will be sent with the primary address of
the interface that received the packet that caused the icmp error.
This is the behaviour many network administrators will expect from
a router. And it can make debugging complicated network layouts
@@ -1560,7 +1756,12 @@ icmp_errors_use_inbound_ifaddr - BOOLEAN
then the primary address of the first non-loopback interface that
has one will be used regardless of this setting.
- Default: 0
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 0 (disabled)
igmp_max_memberships - INTEGER
Change the maximum number of multicast groups we can subscribe to.
@@ -1690,10 +1891,10 @@ proxy_arp_pvlan - BOOLEAN
This technology is known by different names:
- In RFC 3069 it is called VLAN Aggregation.
- Cisco and Allied Telesyn call it Private VLAN.
- Hewlett-Packard call it Source-Port filtering or port-isolation.
- Ericsson call it MAC-Forced Forwarding (RFC Draft).
+ - In RFC 3069 it is called VLAN Aggregation.
+ - Cisco and Allied Telesyn call it Private VLAN.
+ - Hewlett-Packard call it Source-Port filtering or port-isolation.
+ - Ericsson call it MAC-Forced Forwarding (RFC Draft).
proxy_delay - INTEGER
Delay proxy response.
@@ -1910,8 +2111,12 @@ arp_evict_nocarrier - BOOLEAN
between access points on the same network. In most cases this should
remain as the default (1).
- - 1 - (default): Clear the ARP cache on NOCARRIER events
- - 0 - Do not clear ARP cache on NOCARRIER events
+ Possible values:
+
+ - 0 (disabled) - Do not clear ARP cache on NOCARRIER events
+ - 1 (enabled) - Clear the ARP cache on NOCARRIER events
+
+ Default: 1 (enabled)
mcast_solicit - INTEGER
The maximum number of multicast probes in INCOMPLETE state,
@@ -1934,9 +2139,23 @@ mcast_resolicit - INTEGER
disable_policy - BOOLEAN
Disable IPSEC policy (SPD) for this interface
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 0 (disabled)
+
disable_xfrm - BOOLEAN
Disable IPSEC encryption on this interface, whatever the policy
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 0 (disabled)
+
igmpv2_unsolicited_report_interval - INTEGER
The interval in milliseconds in which the next unsolicited
IGMPv1 or IGMPv2 report retransmit will take place.
@@ -1952,11 +2171,25 @@ igmpv3_unsolicited_report_interval - INTEGER
ignore_routes_with_linkdown - BOOLEAN
Ignore routes whose link is down when performing a FIB lookup.
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 0 (disabled)
+
promote_secondaries - BOOLEAN
When a primary IP address is removed from this interface
promote a corresponding secondary IP address instead of
removing all the corresponding secondary IP addresses.
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 0 (disabled)
+
drop_unicast_in_l2_multicast - BOOLEAN
Drop any unicast IP packets that are received in link-layer
multicast (or broadcast) frames.
@@ -1964,14 +2197,24 @@ drop_unicast_in_l2_multicast - BOOLEAN
This behavior (for multicast) is actually a SHOULD in RFC
1122, but is disabled by default for compatibility reasons.
- Default: off (0)
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 0 (disabled)
drop_gratuitous_arp - BOOLEAN
Drop all gratuitous ARP frames, for example if there's a known
good ARP proxy on the network and such frames need not be used
(or in the case of 802.11, must not be used to prevent attacks.)
- Default: off (0)
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 0 (disabled)
tag - INTEGER
@@ -2015,20 +2258,24 @@ bindv6only - BOOLEAN
which restricts use of the IPv6 socket to IPv6 communication
only.
- - TRUE: disable IPv4-mapped address feature
- - FALSE: enable IPv4-mapped address feature
+ Possible values:
+
+ - 0 (disabled) - enable IPv4-mapped address feature
+ - 1 (enabled) - disable IPv4-mapped address feature
- Default: FALSE (as specified in RFC3493)
+ Default: 0 (disabled)
flowlabel_consistency - BOOLEAN
Protect the consistency (and unicity) of flow label.
You have to disable it to use IPV6_FL_F_REFLECT flag on the
flow label manager.
- - TRUE: enabled
- - FALSE: disabled
+ Possible values:
- Default: TRUE
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 1 (enabled)
auto_flowlabels - INTEGER
Automatically generate flow labels based on a flow hash of the
@@ -2054,10 +2301,13 @@ flowlabel_state_ranges - BOOLEAN
reserved for the IPv6 flow manager facility, 0x80000-0xFFFFF
is reserved for stateless flow labels as described in RFC6437.
- - TRUE: enabled
- - FALSE: disabled
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 1 (enabled)
- Default: true
flowlabel_reflect - INTEGER
Control flow label reflection. Needed for Path MTU
@@ -2125,10 +2375,13 @@ anycast_src_echo_reply - BOOLEAN
Controls the use of anycast addresses as source addresses for ICMPv6
echo reply
- - TRUE: enabled
- - FALSE: disabled
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 0 (disabled)
- Default: FALSE
idgen_delay - INTEGER
Controls the delay in seconds after which time to retry
@@ -2185,7 +2438,12 @@ skip_notify_on_dev_down - BOOLEAN
to true skips the message, making IPv4 and IPv6 on par in relying
on userspace caches to track link events and evict routes.
- Default: false (generate message)
+ Possible values:
+
+ - 0 (disabled) - generate the message
+ - 1 (enabled) - skip generating the message
+
+ Default: 0 (disabled)
nexthop_compat_mode - BOOLEAN
New nexthop API provides a means for managing nexthops independent of
@@ -2229,8 +2487,10 @@ fib_notify_on_flag_change - INTEGER
ioam6_id - INTEGER
Define the IOAM id of this node. Uses only 24 bits out of 32 in total.
- Min: 0
- Max: 0xFFFFFF
+ Possible value range:
+
+ - Min: 0
+ - Max: 0xFFFFFF
Default: 0xFFFFFF
@@ -2238,8 +2498,10 @@ ioam6_id_wide - LONG INTEGER
Define the wide IOAM id of this node. Uses only 56 bits out of 64 in
total. Can be different from ioam6_id.
- Min: 0
- Max: 0xFFFFFFFFFFFFFF
+ Possible value range:
+
+ - Min: 0
+ - Max: 0xFFFFFFFFFFFFFF
Default: 0xFFFFFFFFFFFFFF
@@ -2281,8 +2543,8 @@ conf/all/disable_ipv6 - BOOLEAN
conf/all/forwarding - BOOLEAN
Enable global IPv6 forwarding between all interfaces.
- IPv4 and IPv6 work differently here; e.g. netfilter must be used
- to control which interfaces may forward packets and which not.
+ IPv4 and IPv6 work differently here; the ``force_forwarding`` flag must
+ be used to control which interfaces may forward packets.
This also sets all interfaces' Host/Router setting
'forwarding' to the specified value. See below for details.
@@ -2292,13 +2554,30 @@ conf/all/forwarding - BOOLEAN
proxy_ndp - BOOLEAN
Do proxy ndp.
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 0 (disabled)
+
+force_forwarding - BOOLEAN
+ Enable forwarding on this interface only -- regardless of the setting on
+ ``conf/all/forwarding``. When setting ``conf.all.forwarding`` to 0,
+ the ``force_forwarding`` flag will be reset on all interfaces.
+
fwmark_reflect - BOOLEAN
Controls the fwmark of kernel-generated IPv6 reply packets that are not
associated with a socket for example, TCP RSTs or ICMPv6 echo replies).
- If unset, these packets have a fwmark of zero. If set, they have the
+ If disabled, these packets have a fwmark of zero. If enabled, they have the
fwmark of the packet they are replying to.
- Default: 0
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 0 (disabled)
``conf/interface/*``:
Change special settings per interface.
@@ -2389,9 +2668,11 @@ ra_honor_pio_life - BOOLEAN
lifetime of an address matching a prefix sent in a Router
Advertisement Prefix Information Option.
- - If enabled, the PIO valid lifetime will always be honored.
- - If disabled, RFC4862 section 5.5.3e is used to determine
+ Possible values:
+
+ - 0 (disabled) - RFC4862 section 5.5.3e is used to determine
the valid lifetime of the address.
+ - 1 (enabled) - the PIO valid lifetime will always be honored.
Default: 0 (disabled)
@@ -2403,8 +2684,10 @@ ra_honor_pio_pflag - BOOLEAN
P-flag suppresses any effects of the A-flag within the same
PIO. For a given PIO, P=1 and A=1 is treated as A=0.
- - If disabled, the P-flag is ignored.
- - If enabled, the P-flag will disable SLAAC autoconfiguration
+ Possible values:
+
+ - 0 (disabled) - the P-flag is ignored.
+ - 1 (enabled) - the P-flag will disable SLAAC autoconfiguration
for the given Prefix Information Option.
Default: 0 (disabled)
@@ -2526,10 +2809,15 @@ mtu - INTEGER
Default: 1280 (IPv6 required minimum)
ip_nonlocal_bind - BOOLEAN
- If set, allows processes to bind() to non-local IPv6 addresses,
+ If enabled, allows processes to bind() to non-local IPv6 addresses,
which can be quite useful - but may break some applications.
- Default: 0
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 0 (disabled)
router_probe_interval - INTEGER
Minimum interval (in seconds) between Router Probing described
@@ -2559,7 +2847,12 @@ use_oif_addrs_only - BOOLEAN
routed via this interface are restricted to the set of addresses
configured on this interface (vis. RFC 6724, section 4).
- Default: false
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 0 (disabled)
use_tempaddr - INTEGER
Preference for Privacy Extensions (RFC3041).
@@ -2684,10 +2977,14 @@ force_tllao - BOOLEAN
ndisc_notify - BOOLEAN
Define mode for notification of address and device changes.
- * 0 - (default): do nothing
- * 1 - Generate unsolicited neighbour advertisements when device is brought
+ Possible values:
+
+ - 0 (disabled) - do nothing
+ - 1 (enabled) - Generate unsolicited neighbour advertisements when device is brought
up or hardware address changes.
+ Default: 0 (disabled)
+
ndisc_tclass - INTEGER
The IPv6 Traffic Class to use by default when sending IPv6 Neighbor
Discovery (Router Solicitation, Router Advertisement, Neighbor
@@ -2704,8 +3001,12 @@ ndisc_evict_nocarrier - BOOLEAN
not be cleared when roaming between access points on the same network.
In most cases this should remain as the default (1).
- - 1 - (default): Clear neighbor discover cache on NOCARRIER events.
- - 0 - Do not clear neighbor discovery cache on NOCARRIER events.
+ Possible values:
+
+ - 0 (disabled) - Do not clear neighbor discovery cache on NOCARRIER events.
+ - 1 (enabled) - Clear neighbor discover cache on NOCARRIER events.
+
+ Default: 1 (enabled)
mldv1_unsolicited_report_interval - INTEGER
The interval in milliseconds in which the next unsolicited
@@ -2734,25 +3035,34 @@ suppress_frag_ndisc - INTEGER
optimistic_dad - BOOLEAN
Whether to perform Optimistic Duplicate Address Detection (RFC 4429).
- * 0: disabled (default)
- * 1: enabled
-
Optimistic Duplicate Address Detection for the interface will be enabled
if at least one of conf/{all,interface}/optimistic_dad is set to 1,
it will be disabled otherwise.
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 0 (disabled)
+
+
use_optimistic - BOOLEAN
If enabled, do not classify optimistic addresses as deprecated during
source address selection. Preferred addresses will still be chosen
before optimistic addresses, subject to other ranking in the source
address selection algorithm.
- * 0: disabled (default)
- * 1: enabled
-
This will be enabled if at least one of
conf/{all,interface}/use_optimistic is set to 1, disabled otherwise.
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 0 (disabled)
+
stable_secret - IPv6 address
This IPv6 address will be used as a secret to generate IPv6
addresses for link-local addresses and autoconfigured
@@ -2783,14 +3093,24 @@ drop_unicast_in_l2_multicast - BOOLEAN
Drop any unicast IPv6 packets that are received in link-layer
multicast (or broadcast) frames.
- By default this is turned off.
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 0 (disabled)
drop_unsolicited_na - BOOLEAN
Drop all unsolicited neighbor advertisements, for example if there's
a known good NA proxy on the network and such frames need not be used
(or in the case of 802.11, must not be used to prevent attacks.)
- By default this is turned off.
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 0 (disabled).
accept_untracked_na - INTEGER
Define behavior for accepting neighbor advertisements from devices that
@@ -2831,7 +3151,12 @@ enhanced_dad - BOOLEAN
The nonce option will be sent on an interface unless both of
conf/{all,interface}/enhanced_dad are set to FALSE.
- Default: TRUE
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 1 (enabled)
``icmp/*``:
===========
@@ -2860,29 +3185,49 @@ ratemask - list of comma separated ranges
Default: 0-1,3-127 (rate limit ICMPv6 errors except Packet Too Big)
echo_ignore_all - BOOLEAN
- If set non-zero, then the kernel will ignore all ICMP ECHO
+ If enabled, then the kernel will ignore all ICMP ECHO
requests sent to it over the IPv6 protocol.
- Default: 0
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 0 (disabled)
echo_ignore_multicast - BOOLEAN
- If set non-zero, then the kernel will ignore all ICMP ECHO
+ If enabled, then the kernel will ignore all ICMP ECHO
requests sent to it over the IPv6 protocol via multicast.
- Default: 0
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 0 (disabled)
echo_ignore_anycast - BOOLEAN
- If set non-zero, then the kernel will ignore all ICMP ECHO
+ If enabled, then the kernel will ignore all ICMP ECHO
requests sent to it over the IPv6 protocol destined to anycast address.
- Default: 0
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 0 (disabled)
error_anycast_as_unicast - BOOLEAN
- If set to 1, then the kernel will respond with ICMP Errors
+ If enabled, then the kernel will respond with ICMP Errors
resulting from requests sent to it over the IPv6 protocol destined
to anycast address essentially treating anycast as unicast.
- Default: 0
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
+ Default: 0 (disabled)
xfrm6_gc_thresh - INTEGER
(Obsolete since linux-4.14)
@@ -2900,34 +3245,49 @@ YOSHIFUJI Hideaki / USAGI Project <yoshfuji@linux-ipv6.org>
=================================
bridge-nf-call-arptables - BOOLEAN
- - 1 : pass bridged ARP traffic to arptables' FORWARD chain.
- - 0 : disable this.
- Default: 1
+ Possible values:
+
+ - 0 (disabled) - disable this.
+ - 1 (enabled) - pass bridged ARP traffic to arptables' FORWARD chain.
+
+ Default: 1 (enabled)
bridge-nf-call-iptables - BOOLEAN
- - 1 : pass bridged IPv4 traffic to iptables' chains.
- - 0 : disable this.
- Default: 1
+ Possible values:
+
+ - 0 (disabled) - disable this.
+ - 1 (enabled) - pass bridged IPv4 traffic to iptables' chains.
+
+ Default: 1 (enabled)
bridge-nf-call-ip6tables - BOOLEAN
- - 1 : pass bridged IPv6 traffic to ip6tables' chains.
- - 0 : disable this.
- Default: 1
+ Possible values:
+
+ - 0 (disabled) - disable this.
+ - 1 (enabled) - pass bridged IPv6 traffic to ip6tables' chains.
+
+ Default: 1 (enabled)
bridge-nf-filter-vlan-tagged - BOOLEAN
- - 1 : pass bridged vlan-tagged ARP/IP/IPv6 traffic to {arp,ip,ip6}tables.
- - 0 : disable this.
- Default: 0
+ Possible values:
+
+ - 0 (disabled) - disable this.
+ - 1 (enabled) - pass bridged vlan-tagged ARP/IP/IPv6 traffic to {arp,ip,ip6}tables
+
+ Default: 0 (disabled)
bridge-nf-filter-pppoe-tagged - BOOLEAN
- - 1 : pass bridged pppoe-tagged IP/IPv6 traffic to {ip,ip6}tables.
- - 0 : disable this.
- Default: 0
+ Possible values:
+
+ - 0 (disabled) - disable this.
+ - 1 (enabled) - pass bridged pppoe-tagged IP/IPv6 traffic to {ip,ip6}tables.
+
+ Default: 0 (disabled)
bridge-nf-pass-vlan-input-dev - BOOLEAN
- 1: if bridge-nf-filter-vlan-tagged is enabled, try to find a vlan
@@ -2950,11 +3310,12 @@ addip_enable - BOOLEAN
the ability to dynamically add and remove new addresses for the SCTP
associations.
- 1: Enable extension.
+ Possible values:
- 0: Disable extension.
+ - 0 (disabled) - disable extension.
+ - 1 (enabled) - enable extension
- Default: 0
+ Default: 0 (disabled)
pf_enable - INTEGER
Enable or disable pf (pf is short for potentially failed) state. A value
@@ -2969,31 +3330,27 @@ pf_enable - INTEGER
https://datatracker.ietf.org/doc/draft-ietf-tsvwg-sctp-failover for
details.
- 1: Enable pf.
+ Possible values:
- 0: Disable pf.
+ - 1: Enable pf.
+ - 0: Disable pf.
Default: 1
pf_expose - INTEGER
Unset or enable/disable pf (pf is short for potentially failed) state
exposure. Applications can control the exposure of the PF path state
- in the SCTP_PEER_ADDR_CHANGE event and the SCTP_GET_PEER_ADDR_INFO
- sockopt. When it's unset, no SCTP_PEER_ADDR_CHANGE event with
- SCTP_ADDR_PF state will be sent and a SCTP_PF-state transport info
- can be got via SCTP_GET_PEER_ADDR_INFO sockopt; When it's enabled,
- a SCTP_PEER_ADDR_CHANGE event will be sent for a transport becoming
- SCTP_PF state and a SCTP_PF-state transport info can be got via
- SCTP_GET_PEER_ADDR_INFO sockopt; When it's disabled, no
- SCTP_PEER_ADDR_CHANGE event will be sent and it returns -EACCES when
- trying to get a SCTP_PF-state transport info via SCTP_GET_PEER_ADDR_INFO
- sockopt.
-
- 0: Unset pf state exposure, Compatible with old applications.
+ in the SCTP_PEER_ADDR_CHANGE event and access of SCTP_PF-state
+ transport info via SCTP_GET_PEER_ADDR_INFO sockopt.
- 1: Disable pf state exposure.
+ Possible values:
- 2: Enable pf state exposure.
+ - 0: Unset pf state exposure (compatible with old applications). No
+ event will be sent but the transport info can be queried.
+ - 1: Disable pf state exposure. No event will be sent and trying to
+ obtain transport info will return -EACCESS.
+ - 2: Enable pf state exposure. The event will be sent for a transport
+ becoming SCTP_PF state and transport info can be obtained.
Default: 0
@@ -3023,19 +3380,23 @@ auth_enable - BOOLEAN
required for secure operation of Dynamic Address Reconfiguration
(ADD-IP) extension.
- - 1: Enable this extension.
- - 0: Disable this extension.
+ Possible values:
- Default: 0
+ - 0 (disabled) - disable extension.
+ - 1 (enabled) - enable extension
+
+ Default: 0 (disabled)
prsctp_enable - BOOLEAN
Enable or disable the Partial Reliability extension (RFC3758) which
is used to notify peers that a given DATA should no longer be expected.
- - 1: Enable extension
- - 0: Disable
+ Possible values:
- Default: 1
+ - 0 (disabled) - disable extension.
+ - 1 (enabled) - enable extension
+
+ Default: 1 (enabled)
max_burst - INTEGER
The limit of the number of new packets that can be initially sent. It
@@ -3135,10 +3496,12 @@ cookie_preserve_enable - BOOLEAN
Enable or disable the ability to extend the lifetime of the SCTP cookie
that is used during the establishment phase of SCTP association
- - 1: Enable cookie lifetime extension.
- - 0: Disable
+ Possible values:
+
+ - 0 (disabled) - disable.
+ - 1 (enabled) - enable cookie lifetime extension.
- Default: 1
+ Default: 1 (enabled)
cookie_hmac_alg - STRING
Select the hmac algorithm used when generating the cookie value sent by
@@ -3183,13 +3546,11 @@ sndbuf_policy - INTEGER
sctp_mem - vector of 3 INTEGERs: min, pressure, max
Number of pages allowed for queueing by all SCTP sockets.
- min: Below this number of pages SCTP is not bothered about its
- memory appetite. When amount of memory allocated by SCTP exceeds
- this number, SCTP starts to moderate memory usage.
-
- pressure: This value was introduced to follow format of tcp_mem.
-
- max: Number of pages allowed for queueing by all SCTP sockets.
+ * min: Below this number of pages SCTP is not bothered about its
+ memory usage. When amount of memory allocated by SCTP exceeds
+ this number, SCTP starts to moderate memory usage.
+ * pressure: This value was introduced to follow format of tcp_mem.
+ * max: Maximum number of allowed pages.
Default is calculated at boot time from amount of available memory.
@@ -3197,9 +3558,9 @@ sctp_rmem - vector of 3 INTEGERs: min, default, max
Only the first value ("min") is used, "default" and "max" are
ignored.
- min: Minimal size of receive buffer used by SCTP socket.
- It is guaranteed to each SCTP socket (but not association) even
- under moderate memory pressure.
+ * min: Minimal size of receive buffer used by SCTP socket.
+ It is guaranteed to each SCTP socket (but not association) even
+ under moderate memory pressure.
Default: 4K
@@ -3207,14 +3568,16 @@ sctp_wmem - vector of 3 INTEGERs: min, default, max
Only the first value ("min") is used, "default" and "max" are
ignored.
- min: Minimum size of send buffer that can be used by SCTP sockets.
- It is guaranteed to each SCTP socket (but not association) even
- under moderate memory pressure.
+ * min: Minimum size of send buffer that can be used by SCTP sockets.
+ It is guaranteed to each SCTP socket (but not association) even
+ under moderate memory pressure.
Default: 4K
addr_scope_policy - INTEGER
- Control IPv4 address scoping - draft-stewart-tsvwg-sctp-ipv4-00
+ Control IPv4 address scoping (see
+ https://datatracker.ietf.org/doc/draft-stewart-tsvwg-sctp-ipv4/00/
+ for details).
- 0 - Disable IPv4 address scoping
- 1 - Enable IPv4 address scoping
@@ -3272,10 +3635,12 @@ reconf_enable - BOOLEAN
a stream, and it includes the Parameters of "Outgoing/Incoming SSN
Reset", "SSN/TSN Reset" and "Add Outgoing/Incoming Streams".
- - 1: Enable extension.
- - 0: Disable extension.
+ Possible values:
- Default: 0
+ - 0 (disabled) - Disable extension.
+ - 1 (enabled) - Enable extension.
+
+ Default: 0 (disabled)
intl_enable - BOOLEAN
Enable or disable extension of User Message Interleaving functionality
@@ -3286,10 +3651,12 @@ intl_enable - BOOLEAN
to 1 and also needs to set socket options SCTP_FRAGMENT_INTERLEAVE to 2
and SCTP_INTERLEAVING_SUPPORTED to 1.
- - 1: Enable extension.
- - 0: Disable extension.
+ Possible values:
- Default: 0
+ - 0 (disabled) - Disable extension.
+ - 1 (enabled) - Enable extension.
+
+ Default: 0 (disabled)
ecn_enable - BOOLEAN
Control use of Explicit Congestion Notification (ECN) by SCTP.
@@ -3298,10 +3665,12 @@ ecn_enable - BOOLEAN
due to congestion by allowing supporting routers to signal congestion
before having to drop packets.
- 1: Enable ecn.
- 0: Disable ecn.
+ Possible values:
- Default: 1
+ - 0 (disabled) - Disable ecn.
+ - 1 (enabled) - Enable ecn.
+
+ Default: 1 (enabled)
l3mdev_accept - BOOLEAN
Enabling this option allows a "global" bound socket to work
@@ -3310,6 +3679,11 @@ l3mdev_accept - BOOLEAN
originated. Only valid when the kernel was compiled with
CONFIG_NET_L3_MASTER_DEV.
+ Possible values:
+
+ - 0 (disabled)
+ - 1 (enabled)
+
Default: 1 (enabled)
diff --git a/Documentation/networking/napi.rst b/Documentation/networking/napi.rst
index d0e3953cae6a..a15754adb041 100644
--- a/Documentation/networking/napi.rst
+++ b/Documentation/networking/napi.rst
@@ -444,7 +444,14 @@ dependent). The NAPI instance IDs will be assigned in the opposite
order than the process IDs of the kernel threads.
Threaded NAPI is controlled by writing 0/1 to the ``threaded`` file in
-netdev's sysfs directory.
+netdev's sysfs directory. It can also be enabled for a specific NAPI using
+netlink interface.
+
+For example, using the script:
+
+.. code-block:: bash
+
+ $ ynl --family netdev --do napi-set --json='{"id": 66, "threaded": 1}'
.. rubric:: Footnotes
diff --git a/Documentation/networking/net_cachelines/net_device.rst b/Documentation/networking/net_cachelines/net_device.rst
index 6327e689e8a8..1c19bb7705df 100644
--- a/Documentation/networking/net_cachelines/net_device.rst
+++ b/Documentation/networking/net_cachelines/net_device.rst
@@ -10,6 +10,7 @@ Type Name fastpath_tx_acce
=================================== =========================== =================== =================== ===================================================================================
unsigned_long:32 priv_flags read_mostly __dev_queue_xmit(tx)
unsigned_long:1 lltx read_mostly HARD_TX_LOCK,HARD_TX_TRYLOCK,HARD_TX_UNLOCK(tx)
+unsigned long:1 netmem_tx:1; read_mostly
char name[16]
struct netdev_name_node* name_node
struct dev_ifalias* ifalias
@@ -67,6 +68,7 @@ unsigned_char addr_assign_type
unsigned_char addr_len
unsigned_char upper_level
unsigned_char lower_level
+u8 threaded napi_poll(napi_enable,netif_set_threaded)
unsigned_short neigh_priv_len
unsigned_short padded
unsigned_short dev_id
@@ -131,7 +133,7 @@ struct ref_tracker_dir refcnt_tracker
struct list_head link_watch_list
enum:8 reg_state
bool dismantle
-enum:16 rtnl_link_state
+bool rtnl_link_initilizing
bool needs_free_netdev
void*priv_destructor struct net_device
struct netpoll_info* npinfo read_mostly napi_poll/napi_poll_lock
@@ -164,7 +166,6 @@ struct sfp_bus* sfp_bus
struct lock_class_key* qdisc_tx_busylock
bool proto_down
unsigned:1 wol_enabled
-unsigned:1 threaded napi_poll(napi_enable,dev_set_threaded)
unsigned_long:1 see_all_hwtstamp_requests
unsigned_long:1 change_proto_down
unsigned_long:1 netns_immutable
diff --git a/Documentation/networking/net_cachelines/snmp.rst b/Documentation/networking/net_cachelines/snmp.rst
index bc96efc92cf5..bce4eb35ec48 100644
--- a/Documentation/networking/net_cachelines/snmp.rst
+++ b/Documentation/networking/net_cachelines/snmp.rst
@@ -36,7 +36,10 @@ unsigned_long LINUX_MIB_TIMEWAITRECYCLED
unsigned_long LINUX_MIB_TIMEWAITKILLED
unsigned_long LINUX_MIB_PAWSACTIVEREJECTED
unsigned_long LINUX_MIB_PAWSESTABREJECTED
+unsigned_long LINUX_MIB_BEYOND_WINDOW
unsigned_long LINUX_MIB_TSECR_REJECTED
+unsigned_long LINUX_MIB_PAWS_OLD_ACK
+unsigned_long LINUX_MIB_PAWS_TW_REJECTED
unsigned_long LINUX_MIB_DELAYEDACKLOST
unsigned_long LINUX_MIB_LISTENOVERFLOWS
unsigned_long LINUX_MIB_LISTENDROPS
diff --git a/Documentation/networking/net_cachelines/tcp_sock.rst b/Documentation/networking/net_cachelines/tcp_sock.rst
index bc9b2131bf7a..7bbda5944ee2 100644
--- a/Documentation/networking/net_cachelines/tcp_sock.rst
+++ b/Documentation/networking/net_cachelines/tcp_sock.rst
@@ -115,7 +115,6 @@ u32 lost_out read_mostly read_m
u32 sacked_out read_mostly read_mostly tcp_left_out(tx);tcp_packets_in_flight(tx/rx);tcp_clean_rtx_queue(rx)
struct hrtimer pacing_timer
struct hrtimer compressed_ack_timer
-struct sk_buff* lost_skb_hint read_mostly tcp_clean_rtx_queue
struct sk_buff* retransmit_skb_hint read_mostly tcp_clean_rtx_queue
struct rb_root out_of_order_queue read_mostly tcp_data_queue,tcp_fast_path_check
struct sk_buff* ooo_last_skb
@@ -123,7 +122,6 @@ struct tcp_sack_block[1] duplicate_sack
struct tcp_sack_block[4] selective_acks
struct tcp_sack_block[4] recv_sack_cache
struct sk_buff* highest_sack read_write tcp_event_new_data_sent
-int lost_cnt_hint
u32 prior_ssthresh
u32 high_seq
u32 retrans_stamp
diff --git a/Documentation/networking/netconsole.rst b/Documentation/networking/netconsole.rst
index a0076b542e9c..59cb9982afe6 100644
--- a/Documentation/networking/netconsole.rst
+++ b/Documentation/networking/netconsole.rst
@@ -340,6 +340,38 @@ In this example, the message was sent by CPU 42.
cpu=42 # kernel-populated value
+Message ID auto population in userdata
+--------------------------------------
+
+Within the netconsole configfs hierarchy, there is a file named `msgid_enabled`
+located in the `userdata` directory. This file controls the message ID
+auto-population feature, which assigns a numeric id to each message sent to a
+given target and appends the ID to userdata dictionary in every message sent.
+
+The message ID is generated using a per-target 32 bit counter that is
+incremented for every message sent to the target. Note that this counter will
+eventually wrap around after reaching uint32_t max value, so the message ID is
+not globally unique over time. However, it can still be used by the target to
+detect if messages were dropped before reaching the target by identifying gaps
+in the sequence of IDs.
+
+It is important to distinguish message IDs from the message <sequnum> field.
+Some kernel messages may never reach netconsole (for example, due to printk
+rate limiting). Thus, a gap in <sequnum> cannot be solely relied upon to
+indicate that a message was dropped during transmission, as it may never have
+been sent via netconsole. The message ID, on the other hand, is only assigned
+to messages that are actually transmitted via netconsole.
+
+Example::
+
+ echo "This is message #1" > /dev/kmsg
+ echo "This is message #2" > /dev/kmsg
+ 13,434,54928466,-;This is message #1
+ msgid=1
+ 13,435,54934019,-;This is message #2
+ msgid=2
+
+
Extended console:
=================
diff --git a/Documentation/networking/netdev-features.rst b/Documentation/networking/netdev-features.rst
index 5014f7cc1398..02bd7536fc0c 100644
--- a/Documentation/networking/netdev-features.rst
+++ b/Documentation/networking/netdev-features.rst
@@ -188,3 +188,8 @@ Redundancy) frames from one port to another in hardware.
This should be set for devices which duplicate outgoing HSR (High-availability
Seamless Redundancy) or PRP (Parallel Redundancy Protocol) tags automatically
frames in hardware.
+
+* netmem-tx
+
+This should be set for devices which support netmem TX. See
+Documentation/networking/netmem.rst
diff --git a/Documentation/networking/netdevices.rst b/Documentation/networking/netdevices.rst
index ebb868f50ac2..7ebb6c36482d 100644
--- a/Documentation/networking/netdevices.rst
+++ b/Documentation/networking/netdevices.rst
@@ -8,7 +8,7 @@ Network Devices, the Kernel, and You!
Introduction
============
The following is a random collection of documentation regarding
-network devices.
+network devices. It is intended for driver developers.
struct net_device lifetime rules
================================
@@ -314,13 +314,8 @@ napi->poll:
softirq
will be called with interrupts disabled by netconsole.
-struct netdev_queue_mgmt_ops synchronization rules
-==================================================
-
-All queue management ndo callbacks are holding netdev instance lock.
-
-RTNL and netdev instance lock
-=============================
+netdev instance lock
+====================
Historically, all networking control operations were protected by a single
global lock known as ``rtnl_lock``. There is an ongoing effort to replace this
@@ -328,20 +323,89 @@ global lock with separate locks for each network namespace. Additionally,
properties of individual netdev are increasingly protected by per-netdev locks.
For device drivers that implement shaping or queue management APIs, all control
-operations will be performed under the netdev instance lock. Currently, this
-instance lock is acquired within the context of ``rtnl_lock``. The drivers
-can also explicitly request instance lock to be acquired via
-``request_ops_lock``. In the future, there will be an option for individual
+operations will be performed under the netdev instance lock.
+Drivers can also explicitly request instance lock to be held during ops
+by setting ``request_ops_lock`` to true. Code comments and docs refer
+to drivers which have ops called under the instance lock as "ops locked".
+See also the documentation of the ``lock`` member of struct net_device.
+
+In the future, there will be an option for individual
drivers to opt out of using ``rtnl_lock`` and instead perform their control
operations directly under the netdev instance lock.
Devices drivers are encouraged to rely on the instance lock where possible.
For the (mostly software) drivers that need to interact with the core stack,
-there are two sets of interfaces: ``dev_xxx`` and ``netif_xxx`` (e.g.,
-``dev_set_mtu`` and ``netif_set_mtu``). The ``dev_xxx`` functions handle
-acquiring the instance lock themselves, while the ``netif_xxx`` functions
-assume that the driver has already acquired the instance lock.
+there are two sets of interfaces: ``dev_xxx``/``netdev_xxx`` and ``netif_xxx``
+(e.g., ``dev_set_mtu`` and ``netif_set_mtu``). The ``dev_xxx``/``netdev_xxx``
+functions handle acquiring the instance lock themselves, while the
+``netif_xxx`` functions assume that the driver has already acquired
+the instance lock.
+
+struct net_device_ops
+---------------------
+
+``ndos`` are called without holding the instance lock for most drivers.
+
+"Ops locked" drivers will have most of the ``ndos`` invoked under
+the instance lock.
+
+struct ethtool_ops
+------------------
+
+Similarly to ``ndos`` the instance lock is only held for select drivers.
+For "ops locked" drivers all ethtool ops without exceptions should
+be called under the instance lock.
+
+struct netdev_stat_ops
+----------------------
+
+"qstat" ops are invoked under the instance lock for "ops locked" drivers,
+and under rtnl_lock for all other drivers.
+
+struct net_shaper_ops
+---------------------
+
+All net shaper callbacks are invoked while holding the netdev instance
+lock. ``rtnl_lock`` may or may not be held.
+
+Note that supporting net shapers automatically enables "ops locking".
+
+struct netdev_queue_mgmt_ops
+----------------------------
+
+All queue management callbacks are invoked while holding the netdev instance
+lock. ``rtnl_lock`` may or may not be held.
+
+Note that supporting struct netdev_queue_mgmt_ops automatically enables
+"ops locking".
+
+Notifiers and netdev instance lock
+----------------------------------
+
+For device drivers that implement shaping or queue management APIs,
+some of the notifiers (``enum netdev_cmd``) are running under the netdev
+instance lock.
+
+The following netdev notifiers are always run under the instance lock:
+* ``NETDEV_XDP_FEAT_CHANGE``
+
+For devices with locked ops, currently only the following notifiers are
+running under the lock:
+* ``NETDEV_CHANGE``
+* ``NETDEV_REGISTER``
+* ``NETDEV_UP``
+
+The following notifiers are running without the lock:
+* ``NETDEV_UNREGISTER``
+
+There are no clear expectations for the remaining notifiers. Notifiers not on
+the list may run with or without the instance lock, potentially even invoking
+the same notifier type with and without the lock from different code paths.
+The goal is to eventually ensure that all (or most, with a few documented
+exceptions) notifiers run under the instance lock. Please extend this
+documentation whenever you make explicit assumption about lock being held
+from a notifier.
NETDEV_INTERNAL symbol namespace
================================
diff --git a/Documentation/networking/netmem.rst b/Documentation/networking/netmem.rst
index 7de21ddb5412..b63aded46337 100644
--- a/Documentation/networking/netmem.rst
+++ b/Documentation/networking/netmem.rst
@@ -19,8 +19,8 @@ Benefits of Netmem :
* Simplified Development: Drivers interact with a consistent API,
regardless of the underlying memory implementation.
-Driver Requirements
-===================
+Driver RX Requirements
+======================
1. The driver must support page_pool.
@@ -77,3 +77,22 @@ Driver Requirements
that purpose, but be mindful that some netmem types might have longer
circulation times, such as when userspace holds a reference in zerocopy
scenarios.
+
+Driver TX Requirements
+======================
+
+1. The Driver must not pass the netmem dma_addr to any of the dma-mapping APIs
+ directly. This is because netmem dma_addrs may come from a source like
+ dma-buf that is not compatible with the dma-mapping APIs.
+
+ Helpers like netmem_dma_unmap_page_attrs() & netmem_dma_unmap_addr_set()
+ should be used in lieu of dma_unmap_page[_attrs](), dma_unmap_addr_set().
+ The netmem variants will handle netmem dma_addrs correctly regardless of the
+ source, delegating to the dma-mapping APIs when appropriate.
+
+ Not all dma-mapping APIs have netmem equivalents at the moment. If your
+ driver relies on a missing netmem API, feel free to add and propose to
+ netdev@, or reach out to the maintainers and/or almasrymina@google.com for
+ help adding the netmem API.
+
+2. Driver should declare support by setting `netdev->netmem_tx = true`
diff --git a/Documentation/networking/nf_conntrack-sysctl.rst b/Documentation/networking/nf_conntrack-sysctl.rst
index 238b66d0e059..35f889259fcd 100644
--- a/Documentation/networking/nf_conntrack-sysctl.rst
+++ b/Documentation/networking/nf_conntrack-sysctl.rst
@@ -85,7 +85,6 @@ nf_conntrack_log_invalid - INTEGER
- 1 - log ICMP packets
- 6 - log TCP packets
- 17 - log UDP packets
- - 33 - log DCCP packets
- 41 - log ICMPv6 packets
- 136 - log UDPLITE packets
- 255 - log packets of any protocol
diff --git a/Documentation/networking/phy.rst b/Documentation/networking/phy.rst
index f64641417c54..7f159043ad5a 100644
--- a/Documentation/networking/phy.rst
+++ b/Documentation/networking/phy.rst
@@ -333,6 +333,13 @@ Some of the interface modes are described below:
SerDes lane, each port having speeds of 2.5G / 1G / 100M / 10M achieved
through symbol replication. The PCS expects the standard USXGMII code word.
+``PHY_INTERFACE_MODE_MIILITE``
+ Non-standard, simplified MII mode, without TXER, RXER, CRS and COL signals
+ as defined for the MII. The absence of COL signal makes half-duplex link
+ modes impossible but does not interfere with BroadR-Reach link modes on
+ Broadcom (and other two-wire Ethernet) PHYs, because they are full-duplex
+ only.
+
Pause frames / flow control
===========================
diff --git a/Documentation/networking/rds.rst b/Documentation/networking/rds.rst
index 498395f5fbcb..41b0a6182fe4 100644
--- a/Documentation/networking/rds.rst
+++ b/Documentation/networking/rds.rst
@@ -265,7 +265,7 @@ RDS Protocol
The bitmaps are allocated as connections are brought up. This
avoids allocation in the interrupt handling path which queues
- sages on sockets. The dense bitmaps let transports send the
+ messages on sockets. The dense bitmaps let transports send the
entire bitmap on any bitmap change reasonably efficiently. This
is much easier to implement than some finer-grained
communication of per-port congestion. The sender does a very
@@ -373,7 +373,7 @@ The recv path
- validate header checksum
- copy header to rds_ib_incoming struct if start of a new datagram
- add to ibinc's fraglist
- - if competed datagram:
+ - if completed datagram:
- update cong map if datagram was cong update
- call rds_recv_incoming() otherwise
- note if ack is required
@@ -415,7 +415,7 @@ Multipath RDS (mprds)
I/O workqs and reconnect threads are driven from the rds_conn_path.
Transports such as TCP that are multipath capable may then set up a
TCP socket per rds_conn_path, and this is managed by the transport via
- the transport privatee cp_transport_data pointer.
+ the transport private cp_transport_data pointer.
Transports announce themselves as multipath capable by setting the
t_mp_capable bit during registration with the rds core module. When the
@@ -430,7 +430,7 @@ Multipath RDS (mprds)
This is done by sending out a control packet exchange before the
first data packet. The control packet exchange must have completed
prior to outgoing hash completion in rds_sendmsg() when the transport
- is mutlipath capable.
+ is multipath capable.
The control packet is an RDS ping packet (i.e., packet to rds dest
port 0) with the ping packet having a rds extension header option of
diff --git a/Documentation/networking/rxrpc.rst b/Documentation/networking/rxrpc.rst
index e807e18ba32a..d63e3e27dd06 100644
--- a/Documentation/networking/rxrpc.rst
+++ b/Documentation/networking/rxrpc.rst
@@ -1062,30 +1062,6 @@ The kernel interface functions are as follows:
first function to change. Note that this must be called in TASK_RUNNING
state.
- (#) Get remote client epoch::
-
- u32 rxrpc_kernel_get_epoch(struct socket *sock,
- struct rxrpc_call *call)
-
- This allows the epoch that's contained in packets of an incoming client
- call to be queried. This value is returned. The function always
- successful if the call is still in progress. It shouldn't be called once
- the call has expired. Note that calling this on a local client call only
- returns the local epoch.
-
- This value can be used to determine if the remote client has been
- restarted as it shouldn't change otherwise.
-
- (#) Set the maximum lifespan on a call::
-
- void rxrpc_kernel_set_max_life(struct socket *sock,
- struct rxrpc_call *call,
- unsigned long hard_timeout)
-
- This sets the maximum lifespan on a call to hard_timeout (which is in
- jiffies). In the event of the timeout occurring, the call will be
- aborted and -ETIME or -ETIMEDOUT will be returned.
-
(#) Apply the RXRPC_MIN_SECURITY_LEVEL sockopt to a socket from within in the
kernel::
@@ -1172,3 +1148,18 @@ adjusted through sysctls in /proc/net/rxrpc/:
header plus exactly 1412 bytes of data. The terminal packet must contain
a four byte header plus any amount of data. In any event, a jumbo packet
may not exceed rxrpc_rx_mtu in size.
+
+
+API Function Reference
+======================
+
+.. kernel-doc:: net/rxrpc/af_rxrpc.c
+.. kernel-doc:: net/rxrpc/call_object.c
+.. kernel-doc:: net/rxrpc/key.c
+.. kernel-doc:: net/rxrpc/oob.c
+.. kernel-doc:: net/rxrpc/peer_object.c
+.. kernel-doc:: net/rxrpc/recvmsg.c
+.. kernel-doc:: net/rxrpc/rxgk.c
+.. kernel-doc:: net/rxrpc/rxkad.c
+.. kernel-doc:: net/rxrpc/sendmsg.c
+.. kernel-doc:: net/rxrpc/server_key.c
diff --git a/Documentation/networking/timestamping.rst b/Documentation/networking/timestamping.rst
index b8fef8101176..7aabead90648 100644
--- a/Documentation/networking/timestamping.rst
+++ b/Documentation/networking/timestamping.rst
@@ -811,11 +811,9 @@ Documentation/devicetree/bindings/ptp/timestamper.txt for more details.
3.2.4 Other caveats for MAC drivers
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-Stacked PHCs, especially DSA (but not only) - since that doesn't require any
-modification to MAC drivers, so it is more difficult to ensure correctness of
-all possible code paths - is that they uncover bugs which were impossible to
-trigger before the existence of stacked PTP clocks. One example has to do with
-this line of code, already presented earlier::
+The use of stacked PHCs may uncover MAC driver bugs which were impossible to
+trigger without them. One example has to do with this line of code, already
+presented earlier::
skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
diff --git a/Documentation/networking/tls.rst b/Documentation/networking/tls.rst
index c7904a1bc167..36cc7afc2527 100644
--- a/Documentation/networking/tls.rst
+++ b/Documentation/networking/tls.rst
@@ -16,11 +16,13 @@ User interface
Creating a TLS connection
-------------------------
-First create a new TCP socket and set the TLS ULP.
+First create a new TCP socket and once the connection is established set the
+TLS ULP.
.. code-block:: c
sock = socket(AF_INET, SOCK_STREAM, 0);
+ connect(sock, addr, addrlen);
setsockopt(sock, SOL_TCP, TCP_ULP, "tls", sizeof("tls"));
Setting the TLS ULP allows us to set/get TLS socket options. Currently
diff --git a/Documentation/networking/tproxy.rst b/Documentation/networking/tproxy.rst
index 7f7c1ff6f159..75e4990cc3db 100644
--- a/Documentation/networking/tproxy.rst
+++ b/Documentation/networking/tproxy.rst
@@ -69,9 +69,9 @@ add rules like this to the iptables ruleset above::
# iptables -t mangle -A PREROUTING -p tcp --dport 80 -j TPROXY \
--tproxy-mark 0x1/0x1 --on-port 50080
-Or the following rule to nft:
+Or the following rule to nft::
-# nft add rule filter divert tcp dport 80 tproxy to :50080 meta mark set 1 accept
+ # nft add rule filter divert tcp dport 80 tproxy to :50080 meta mark set 1 accept
Note that for this to work you'll have to modify the proxy to enable (SOL_IP,
IP_TRANSPARENT) for the listening socket.
diff --git a/Documentation/networking/xdp-rx-metadata.rst b/Documentation/networking/xdp-rx-metadata.rst
index a6e0ece18be5..ce96f4c99505 100644
--- a/Documentation/networking/xdp-rx-metadata.rst
+++ b/Documentation/networking/xdp-rx-metadata.rst
@@ -120,6 +120,39 @@ It is possible to query which kfunc the particular netdev implements via
netlink. See ``xdp-rx-metadata-features`` attribute set in
``Documentation/netlink/specs/netdev.yaml``.
+Driver Implementation
+=====================
+
+Certain devices may prepend metadata to received packets. However, as of now,
+``AF_XDP`` lacks the ability to communicate the size of the ``data_meta`` area
+to the consumer. Therefore, it is the responsibility of the driver to copy any
+device-reserved metadata out from the metadata area and ensure that
+``xdp_buff->data_meta`` is pointing to ``xdp_buff->data`` before presenting the
+frame to the XDP program. This is necessary so that, after the XDP program
+adjusts the metadata area, the consumer can reliably retrieve the metadata
+address using ``METADATA_SIZE`` offset.
+
+The following diagram shows how custom metadata is positioned relative to the
+packet data and how pointers are adjusted for metadata access::
+
+ |<-- bpf_xdp_adjust_meta(xdp_buff, -METADATA_SIZE) --|
+ new xdp_buff->data_meta old xdp_buff->data_meta
+ | |
+ | xdp_buff->data
+ | |
+ +----------+----------------------------------------------------+------+
+ | headroom | custom metadata | data |
+ +----------+----------------------------------------------------+------+
+ | |
+ | xdp_desc->addr
+ |<------ xsk_umem__get_data() - METADATA_SIZE -------|
+
+``bpf_xdp_adjust_meta`` ensures that ``METADATA_SIZE`` is aligned to 4 bytes,
+does not exceed 252 bytes, and leaves sufficient space for building the
+xdp_frame. If these conditions are not met, it returns a negative error. In this
+case, the BPF program should not proceed to populate data into the ``data_meta``
+area.
+
Example
=======
diff --git a/Documentation/networking/xfrm_device.rst b/Documentation/networking/xfrm_device.rst
index 7f24c09f2694..122204da0fff 100644
--- a/Documentation/networking/xfrm_device.rst
+++ b/Documentation/networking/xfrm_device.rst
@@ -65,9 +65,13 @@ Callbacks to implement
/* from include/linux/netdevice.h */
struct xfrmdev_ops {
/* Crypto and Packet offload callbacks */
- int (*xdo_dev_state_add) (struct xfrm_state *x, struct netlink_ext_ack *extack);
- void (*xdo_dev_state_delete) (struct xfrm_state *x);
- void (*xdo_dev_state_free) (struct xfrm_state *x);
+ int (*xdo_dev_state_add)(struct net_device *dev,
+ struct xfrm_state *x,
+ struct netlink_ext_ack *extack);
+ void (*xdo_dev_state_delete)(struct net_device *dev,
+ struct xfrm_state *x);
+ void (*xdo_dev_state_free)(struct net_device *dev,
+ struct xfrm_state *x);
bool (*xdo_dev_offload_ok) (struct sk_buff *skb,
struct xfrm_state *x);
void (*xdo_dev_state_advance_esn) (struct xfrm_state *x);