summaryrefslogtreecommitdiff
path: root/Documentation/networking
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/networking')
-rw-r--r--Documentation/networking/af_xdp.rst33
-rw-r--r--Documentation/networking/bareudp.rst11
-rw-r--r--Documentation/networking/batman-adv.rst2
-rw-r--r--Documentation/networking/bonding.rst19
-rw-r--r--Documentation/networking/can.rst4
-rw-r--r--Documentation/networking/cdc_mbim.rst2
-rw-r--r--Documentation/networking/device_drivers/ethernet/amazon/ena.rst5
-rw-r--r--Documentation/networking/device_drivers/ethernet/freescale/dpaa2/overview.rst2
-rw-r--r--Documentation/networking/device_drivers/ethernet/index.rst1
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/i40e.rst12
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/ice.rst31
-rw-r--r--Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst91
-rw-r--r--Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst51
-rw-r--r--Documentation/networking/device_drivers/ethernet/mellanox/mlx5/kconfig.rst3
-rw-r--r--Documentation/networking/device_drivers/ethernet/meta/fbnic.rst72
-rw-r--r--Documentation/networking/device_drivers/wwan/t7xx.rst64
-rw-r--r--Documentation/networking/devlink/devlink-info.rst5
-rw-r--r--Documentation/networking/devlink/devlink-port.rst33
-rw-r--r--Documentation/networking/devlink/devlink-region.rst2
-rw-r--r--Documentation/networking/devlink/hns3.rst5
-rw-r--r--Documentation/networking/devlink/ice.rst72
-rw-r--r--Documentation/networking/devlink/mlx5.rst3
-rw-r--r--Documentation/networking/devlink/nfp.rst5
-rw-r--r--Documentation/networking/devlink/octeontx2.rst37
-rw-r--r--Documentation/networking/devmem.rst278
-rw-r--r--Documentation/networking/diagnostic/index.rst17
-rw-r--r--Documentation/networking/diagnostic/twisted_pair_layer1_diagnostics.rst784
-rw-r--r--Documentation/networking/dns_resolver.rst4
-rw-r--r--Documentation/networking/ethtool-netlink.rst386
-rw-r--r--Documentation/networking/filter.rst4
-rw-r--r--Documentation/networking/ieee802154.rst16
-rw-r--r--Documentation/networking/index.rst9
-rw-r--r--Documentation/networking/ip-sysctl.rst61
-rw-r--r--Documentation/networking/iso15765-2.rst386
-rw-r--r--Documentation/networking/j1939.rst2
-rw-r--r--Documentation/networking/kapi.rst3
-rw-r--r--Documentation/networking/l2tp.rst54
-rw-r--r--Documentation/networking/mptcp-sysctl.rst91
-rw-r--r--Documentation/networking/mptcp.rst156
-rw-r--r--Documentation/networking/multi-pf-netdev.rst14
-rw-r--r--Documentation/networking/napi.rst175
-rw-r--r--Documentation/networking/net_cachelines/inet_connection_sock.rst86
-rw-r--r--Documentation/networking/net_cachelines/inet_sock.rst74
-rw-r--r--Documentation/networking/net_cachelines/net_device.rst356
-rw-r--r--Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst301
-rw-r--r--Documentation/networking/net_cachelines/snmp.rst256
-rw-r--r--Documentation/networking/net_cachelines/tcp_sock.rst250
-rw-r--r--Documentation/networking/net_dim.rst44
-rw-r--r--Documentation/networking/netconsole.rst5
-rw-r--r--Documentation/networking/netdev-features.rst15
-rw-r--r--Documentation/networking/netdevices.rst14
-rw-r--r--Documentation/networking/netlink_spec/readme.txt2
-rw-r--r--Documentation/networking/netmem.rst79
-rw-r--r--Documentation/networking/nf_conntrack-sysctl.rst4
-rw-r--r--Documentation/networking/oa-tc6-framework.rst497
-rw-r--r--Documentation/networking/packet_mmap.rst5
-rw-r--r--Documentation/networking/phy-link-topology.rst121
-rw-r--r--Documentation/networking/phy.rst6
-rw-r--r--Documentation/networking/pse-pd/index.rst10
-rw-r--r--Documentation/networking/pse-pd/introduction.rst73
-rw-r--r--Documentation/networking/pse-pd/pse-pi.rst301
-rw-r--r--Documentation/networking/sriov.rst25
-rw-r--r--Documentation/networking/switchdev.rst4
-rw-r--r--Documentation/networking/tcp_ao.rst29
-rw-r--r--Documentation/networking/timestamping.rst72
-rw-r--r--Documentation/networking/tipc.rst2
-rw-r--r--Documentation/networking/tls-offload.rst29
-rw-r--r--Documentation/networking/tls.rst36
-rw-r--r--Documentation/networking/tproxy.rst2
-rw-r--r--Documentation/networking/xfrm_device.rst3
-rw-r--r--Documentation/networking/xfrm_proc.rst6
-rw-r--r--Documentation/networking/xsk-tx-metadata.rst16
72 files changed, 4838 insertions, 890 deletions
diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
index 72da7057e4cf..dceeb0d763aa 100644
--- a/Documentation/networking/af_xdp.rst
+++ b/Documentation/networking/af_xdp.rst
@@ -329,24 +329,23 @@ XDP_SHARED_UMEM option and provide the initial socket's fd in the
sxdp_shared_umem_fd field as you registered the UMEM on that
socket. These two sockets will now share one and the same UMEM.
-In this case, it is possible to use the NIC's packet steering
-capabilities to steer the packets to the right queue. This is not
-possible in the previous example as there is only one queue shared
-among sockets, so the NIC cannot do this steering as it can only steer
-between queues.
-
-In libxdp (or libbpf prior to version 1.0), you need to use the
-xsk_socket__create_shared() API as it takes a reference to a FILL ring
-and a COMPLETION ring that will be created for you and bound to the
-shared UMEM. You can use this function for all the sockets you create,
-or you can use it for the second and following ones and use
-xsk_socket__create() for the first one. Both methods yield the same
-result.
+There is no need to supply an XDP program like the one in the previous
+case where sockets were bound to the same queue id and
+device. Instead, use the NIC's packet steering capabilities to steer
+the packets to the right queue. In the previous example, there is only
+one queue shared among sockets, so the NIC cannot do this steering. It
+can only steer between queues.
+
+In libbpf, you need to use the xsk_socket__create_shared() API as it
+takes a reference to a FILL ring and a COMPLETION ring that will be
+created for you and bound to the shared UMEM. You can use this
+function for all the sockets you create, or you can use it for the
+second and following ones and use xsk_socket__create() for the first
+one. Both methods yield the same result.
Note that a UMEM can be shared between sockets on the same queue id
and device, as well as between queues on the same device and between
-devices at the same time. It is also possible to redirect to any
-socket as long as it is bound to the same umem with XDP_SHARED_UMEM.
+devices at the same time.
XDP_USE_NEED_WAKEUP bind flag
-----------------------------
@@ -823,10 +822,6 @@ A: The short answer is no, that is not supported at the moment. The
switch, or other distribution mechanism, in your NIC to direct
traffic to the correct queue id and socket.
- Note that if you are using the XDP_SHARED_UMEM option, it is
- possible to switch traffic between any socket bound to the same
- umem.
-
Q: My packets are sometimes corrupted. What is wrong?
A: Care has to be taken not to feed the same buffer in the UMEM into
diff --git a/Documentation/networking/bareudp.rst b/Documentation/networking/bareudp.rst
index b9d04ee6dac1..621cb9575c8f 100644
--- a/Documentation/networking/bareudp.rst
+++ b/Documentation/networking/bareudp.rst
@@ -6,16 +6,17 @@ Bare UDP Tunnelling Module Documentation
There are various L3 encapsulation standards using UDP being discussed to
leverage the UDP based load balancing capability of different networks.
-MPLSoUDP (__ https://tools.ietf.org/html/rfc7510) is one among them.
+MPLSoUDP (https://tools.ietf.org/html/rfc7510) is one among them.
The Bareudp tunnel module provides a generic L3 encapsulation support for
tunnelling different L3 protocols like MPLS, IP, NSH etc. inside a UDP tunnel.
Special Handling
----------------
+
The bareudp device supports special handling for MPLS & IP as they can have
multiple ethertypes.
-MPLS procotcol can have ethertypes ETH_P_MPLS_UC (unicast) & ETH_P_MPLS_MC (multicast).
+The MPLS protocol can have ethertypes ETH_P_MPLS_UC (unicast) & ETH_P_MPLS_MC (multicast).
IP protocol can have ethertypes ETH_P_IP (v4) & ETH_P_IPV6 (v6).
This special handling can be enabled only for ethertypes ETH_P_IP & ETH_P_MPLS_UC
with a flag called multiproto mode.
@@ -52,7 +53,7 @@ be enabled explicitly with the "multiproto" flag.
3) Device Usage
The bareudp device could be used along with OVS or flower filter in TC.
-The OVS or TC flower layer must set the tunnel information in SKB dst field before
-sending packet buffer to the bareudp device for transmission. On reception the
-bareudp device extracts and stores the tunnel information in SKB dst field before
+The OVS or TC flower layer must set the tunnel information in the SKB dst field before
+sending the packet buffer to the bareudp device for transmission. On reception, the
+bareUDP device extracts and stores the tunnel information in the SKB dst field before
passing the packet buffer to the network stack.
diff --git a/Documentation/networking/batman-adv.rst b/Documentation/networking/batman-adv.rst
index 8a0dcb1894b4..44b9b5cc0e24 100644
--- a/Documentation/networking/batman-adv.rst
+++ b/Documentation/networking/batman-adv.rst
@@ -164,5 +164,5 @@ Mailing-list:
You can also contact the Authors:
-* Marek Lindner <mareklindner@neomailbox.ch>
+* Marek Lindner <marek.lindner@mailbox.org>
* Simon Wunderlich <sw@simonwunderlich.de>
diff --git a/Documentation/networking/bonding.rst b/Documentation/networking/bonding.rst
index e774b48de9f5..a4c1291d2561 100644
--- a/Documentation/networking/bonding.rst
+++ b/Documentation/networking/bonding.rst
@@ -1963,7 +1963,7 @@ obtain its hardware address from the first slave, which might not
match the hardware address of the VLAN interfaces (which was
ultimately copied from an earlier slave).
-There are two methods to insure that the VLAN device operates
+There are two methods to ensure that the VLAN device operates
with the correct hardware address if all slaves are removed from a
bond interface:
@@ -2078,7 +2078,7 @@ as an unsolicited ARP reply (because ARP matches replies on an
interface basis), and is discarded. The MII monitor is not affected
by the state of the routing table.
-The solution here is simply to insure that slaves do not have
+The solution here is simply to ensure that slaves do not have
routes of their own, and if for some reason they must, those routes do
not supersede routes of their master. This should generally be the
case, but unusual configurations or errant manual or automatic static
@@ -2295,7 +2295,7 @@ active-backup:
the switches have an ISL and play together well. If the
network configuration is such that one switch is specifically
a backup switch (e.g., has lower capacity, higher cost, etc),
- then the primary option can be used to insure that the
+ then the primary option can be used to ensure that the
preferred link is always used when it is available.
broadcast:
@@ -2322,7 +2322,7 @@ monitor can provide a higher level of reliability in detecting end to
end connectivity failures (which may be caused by the failure of any
individual component to pass traffic for any reason). Additionally,
the ARP monitor should be configured with multiple targets (at least
-one for each switch in the network). This will insure that,
+one for each switch in the network). This will ensure that,
regardless of which switch is active, the ARP monitor has a suitable
target to query.
@@ -2916,6 +2916,17 @@ from the bond (``ifenslave -d bond0 eth0``). The bonding driver will
then restore the MAC addresses that the slaves had before they were
enslaved.
+9. What bonding modes support native XDP?
+------------------------------------------
+
+ * balance-rr (0)
+ * active-backup (1)
+ * balance-xor (2)
+ * 802.3ad (4)
+
+Note that the vlan+srcmac hash policy does not support native XDP.
+For other bonding modes, the XDP program must be loaded with generic mode.
+
16. Resources and Links
=======================
diff --git a/Documentation/networking/can.rst b/Documentation/networking/can.rst
index 62519d38c58b..b018ce346392 100644
--- a/Documentation/networking/can.rst
+++ b/Documentation/networking/can.rst
@@ -699,10 +699,10 @@ RAW socket option CAN_RAW_JOIN_FILTERS
The CAN_RAW socket can set multiple CAN identifier specific filters that
lead to multiple filters in the af_can.c filter processing. These filters
-are indenpendent from each other which leads to logical OR'ed filters when
+are independent from each other which leads to logical OR'ed filters when
applied (see :ref:`socketcan-rawfilter`).
-This socket option joines the given CAN filters in the way that only CAN
+This socket option joins the given CAN filters in the way that only CAN
frames are passed to user space that matched *all* given CAN filters. The
semantic for the applied filters is therefore changed to a logical AND.
diff --git a/Documentation/networking/cdc_mbim.rst b/Documentation/networking/cdc_mbim.rst
index 37f968acc473..8404a3f794f3 100644
--- a/Documentation/networking/cdc_mbim.rst
+++ b/Documentation/networking/cdc_mbim.rst
@@ -51,7 +51,7 @@ Such userspace applications includes, but are not limited to:
- mbimcli (included with the libmbim [3] library), and
- ModemManager [4]
-Establishing a MBIM IP session reequires at least these actions by the
+Establishing a MBIM IP session requires at least these actions by the
management application:
- open the control channel
diff --git a/Documentation/networking/device_drivers/ethernet/amazon/ena.rst b/Documentation/networking/device_drivers/ethernet/amazon/ena.rst
index a4c7d0c65fd7..4561e8ab9e08 100644
--- a/Documentation/networking/device_drivers/ethernet/amazon/ena.rst
+++ b/Documentation/networking/device_drivers/ethernet/amazon/ena.rst
@@ -230,6 +230,11 @@ per-queue stats) from the device.
In addition the driver logs the stats to syslog upon device reset.
+On supported instance types, the statistics will also include the
+ENA Express data (fields prefixed with `ena_srd`). For a complete
+documentation of ENA Express data refer to
+https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ena-express.html#ena-express-monitor
+
MTU
===
diff --git a/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/overview.rst b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/overview.rst
index 199647729251..32ee827a3a2c 100644
--- a/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/overview.rst
+++ b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/overview.rst
@@ -339,7 +339,7 @@ Key functions include:
a bind of the root DPRC to the DPRC driver
The binding for the MC-bus device-tree node can be consulted at
-*Documentation/devicetree/bindings/misc/fsl,qoriq-mc.txt*.
+*Documentation/devicetree/bindings/misc/fsl,qoriq-mc.yaml*.
The sysfs bind/unbind interfaces for the MC-bus can be consulted at
*Documentation/ABI/testing/sysfs-bus-fsl-mc*.
diff --git a/Documentation/networking/device_drivers/ethernet/index.rst b/Documentation/networking/device_drivers/ethernet/index.rst
index 6932d8c043c2..6fc1961492b7 100644
--- a/Documentation/networking/device_drivers/ethernet/index.rst
+++ b/Documentation/networking/device_drivers/ethernet/index.rst
@@ -44,6 +44,7 @@ Contents:
marvell/octeon_ep
marvell/octeon_ep_vf
mellanox/mlx5/index
+ meta/fbnic
microsoft/netvsc
neterion/s2io
netronome/nfp
diff --git a/Documentation/networking/device_drivers/ethernet/intel/i40e.rst b/Documentation/networking/device_drivers/ethernet/intel/i40e.rst
index 4fbaa1a2d674..53d9d5829d69 100644
--- a/Documentation/networking/device_drivers/ethernet/intel/i40e.rst
+++ b/Documentation/networking/device_drivers/ethernet/intel/i40e.rst
@@ -299,6 +299,18 @@ Use ethtool to view and set link-down-on-close, as follows::
ethtool --show-priv-flags ethX
ethtool --set-priv-flags ethX link-down-on-close [on|off]
+Setting the mdd-auto-reset-vf Private Flag
+------------------------------------------
+
+When the mdd-auto-reset-vf private flag is set to "on", the problematic VF will
+be automatically reset if a malformed descriptor is detected. If the flag is
+set to "off", the problematic VF will be disabled.
+
+Use ethtool to view and set mdd-auto-reset-vf, as follows::
+
+ ethtool --show-priv-flags ethX
+ ethtool --set-priv-flags ethX mdd-auto-reset-vf [on|off]
+
Viewing Link Messages
---------------------
Link messages will not be displayed to the console if the distribution is
diff --git a/Documentation/networking/device_drivers/ethernet/intel/ice.rst b/Documentation/networking/device_drivers/ethernet/intel/ice.rst
index 934752f675ba..3c46a48d99ba 100644
--- a/Documentation/networking/device_drivers/ethernet/intel/ice.rst
+++ b/Documentation/networking/device_drivers/ethernet/intel/ice.rst
@@ -101,6 +101,37 @@ example, if Rx packets are 10 and Netdev (software statistics) displays
rx_bytes as "X", then ethtool (hardware statistics) will display rx_bytes as
"X+40" (4 bytes CRC x 10 packets).
+ethtool reset
+-------------
+The driver supports 3 types of resets:
+
+- PF reset - resets only components associated with the given PF, does not
+ impact other PFs
+
+- CORE reset - whole adapter is affected, reset all PFs
+
+- GLOBAL reset - same as CORE but mac and phy components are also reinitialized
+
+These are mapped to ethtool reset flags as follow:
+
+- PF reset:
+
+ # ethtool --reset <ethX> irq dma filter offload
+
+- CORE reset:
+
+ # ethtool --reset <ethX> irq-shared dma-shared filter-shared offload-shared \
+ ram-shared
+
+- GLOBAL reset:
+
+ # ethtool --reset <ethX> irq-shared dma-shared filter-shared offload-shared \
+ mac-shared phy-shared ram-shared
+
+In switchdev mode you can reset a VF using port representor:
+
+ # ethtool --reset <repr> irq dma filter offload
+
Viewing Link Messages
---------------------
diff --git a/Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst b/Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst
index 1e196cb9ce25..af7db0e91f6b 100644
--- a/Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst
+++ b/Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst
@@ -14,6 +14,7 @@ Contents
- `Basic packet flow`_
- `Devlink health reporters`_
- `Quality of service`_
+- `RVU representors`_
Overview
========
@@ -340,3 +341,93 @@ Setup HTB offload
# tc class add dev <interface> parent 1: classid 1:2 htb rate 10Gbit prio 2 quantum 188416
# tc class add dev <interface> parent 1: classid 1:3 htb rate 10Gbit prio 2 quantum 32768
+
+
+RVU Representors
+================
+
+RVU representor driver adds support for creation of representor devices for
+RVU PFs' VFs in the system. Representor devices are created when user enables
+the switchdev mode.
+Switchdev mode can be enabled either before or after setting up SRIOV numVFs.
+All representor devices share a single NIXLF but each has a dedicated Rx/Tx
+queues. RVU PF representor driver registers a separate netdev for each
+Rx/Tx queue pair.
+
+Current HW does not support built-in switch which can do L2 learning and
+forwarding packets between representee and representor. Hence, packet path
+between representee and it's representor is achieved by setting up appropriate
+NPC MCAM filters.
+Transmit packets matching these filters will be loopbacked through hardware
+loopback channel/interface (i.e, instead of sending them out of MAC interface).
+Which will again match the installed filters and will be forwarded.
+This way representee => representor and representor => representee packet
+path is achieved. These rules get installed when representors are created
+and gets active/deactivate based on the representor/representee interface state.
+
+Usage example:
+
+ - Change device to switchdev mode::
+
+ # devlink dev eswitch set pci/0002:1c:00.0 mode switchdev
+
+ - List of representor devices on the system::
+
+ # ip link show
+ Rpf1vf0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state DOWN mode DEFAULT group default qlen 1000 link/ether f6:43:83:ee:26:21 brd ff:ff:ff:ff:ff:ff
+ Rpf1vf1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state DOWN mode DEFAULT group default qlen 1000 link/ether 12:b2:54:0e:24:54 brd ff:ff:ff:ff:ff:ff
+ Rpf1vf2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state DOWN mode DEFAULT group default qlen 1000 link/ether 4a:12:c4:4c:32:62 brd ff:ff:ff:ff:ff:ff
+ Rpf1vf3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state DOWN mode DEFAULT group default qlen 1000 link/ether ca:cb:68:0e:e2:6e brd ff:ff:ff:ff:ff:ff
+ Rpf2vf0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state DOWN mode DEFAULT group default qlen 1000 link/ether 06:cc:ad:b4:f0:93 brd ff:ff:ff:ff:ff:ff
+
+
+To delete the representors devices from the system. Change the device to legacy mode.
+
+ - Change device to legacy mode::
+
+ # devlink dev eswitch set pci/0002:1c:00.0 mode legacy
+
+RVU representors can be managed using devlink ports
+(see :ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>`) interface.
+
+ - Show devlink ports of representors::
+
+ # devlink port
+ pci/0002:1c:00.0/0: type eth netdev Rpf1vf0 flavour physical port 0 splittable false
+ pci/0002:1c:00.0/1: type eth netdev Rpf1vf1 flavour pcivf controller 0 pfnum 1 vfnum 1 external false splittable false
+ pci/0002:1c:00.0/2: type eth netdev Rpf1vf2 flavour pcivf controller 0 pfnum 1 vfnum 2 external false splittable false
+ pci/0002:1c:00.0/3: type eth netdev Rpf1vf3 flavour pcivf controller 0 pfnum 1 vfnum 3 external false splittable false
+
+Function attributes
+===================
+
+The RVU representor support function attributes for representors.
+Port function configuration of the representors are supported through devlink eswitch port.
+
+MAC address setup
+-----------------
+
+RVU representor driver support devlink port function attr mechanism to setup MAC
+address. (refer to Documentation/networking/devlink/devlink-port.rst)
+
+ - To setup MAC address for port 2::
+
+ # devlink port function set pci/0002:1c:00.0/2 hw_addr 5c:a1:1b:5e:43:11
+ # devlink port show pci/0002:1c:00.0/2
+ pci/0002:1c:00.0/2: type eth netdev Rpf1vf2 flavour pcivf controller 0 pfnum 1 vfnum 2 external false splittable false
+ function:
+ hw_addr 5c:a1:1b:5e:43:11
+
+
+TC offload
+==========
+
+The rvu representor driver implements support for offloading tc rules using port representors.
+
+ - Drop packets with vlan id 3::
+
+ # tc filter add dev Rpf1vf0 protocol 802.1Q parent ffff: flower vlan_id 3 vlan_ethtype ipv4 skip_sw action drop
+
+ - Redirect packets with vlan id 5 and IPv4 packets to eth1, after stripping vlan header.::
+
+ # tc filter add dev Rpf1vf0 ingress protocol 802.1Q flower vlan_id 5 vlan_ethtype ipv4 skip_sw action vlan pop action mirred ingress redirect dev eth1
diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst
index f69ee1ebee01..99d95be4d159 100644
--- a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst
+++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst
@@ -189,22 +189,19 @@ the software port.
* - `rx[i]_gro_packets`
- Number of received packets processed using hardware-accelerated GRO. The
- number of hardware GRO offloaded packets received on ring i.
+ number of hardware GRO offloaded packets received on ring i. Only true GRO
+ packets are counted: only packets that are in an SKB with a GRO count > 1.
- Acceleration
* - `rx[i]_gro_bytes`
- Number of received bytes processed using hardware-accelerated GRO. The
- number of hardware GRO offloaded bytes received on ring i.
+ number of hardware GRO offloaded bytes received on ring i. Only true GRO
+ packets are counted: only packets that are in an SKB with a GRO count > 1.
- Acceleration
* - `rx[i]_gro_skbs`
- - The number of receive SKBs constructed while performing
- hardware-accelerated GRO.
- - Informative
-
- * - `rx[i]_gro_match_packets`
- - Number of received packets processed using hardware-accelerated GRO that
- met the flow table match criteria.
+ - The number of GRO SKBs constructed from hardware-accelerated GRO. Only SKBs
+ with a GRO count > 1 are counted.
- Informative
* - `rx[i]_gro_large_hds`
@@ -212,6 +209,31 @@ the software port.
headers that require additional memory to be allocated.
- Informative
+ * - `rx[i]_hds_nodata_packets`
+ - Number of header only packets in header/data split mode [#accel]_.
+ - Informative
+
+ * - `rx[i]_hds_nodata_bytes`
+ - Number of bytes for header only packets in header/data split mode
+ [#accel]_.
+ - Informative
+
+ * - `rx[i]_hds_nosplit_packets`
+ - Number of packets that were not split in header/data split mode. A
+ packet will not get split when the hardware does not support its
+ protocol splitting. An example such a protocol is ICMPv4/v6. Currently
+ TCP and UDP with IPv4/IPv6 are supported for header/data split
+ [#accel]_.
+ - Informative
+
+ * - `rx[i]_hds_nosplit_bytes`
+ - Number of bytes for packets that were not split in header/data split
+ mode. A packet will not get split when the hardware does not support its
+ protocol splitting. An example such a protocol is ICMPv4/v6. Currently
+ TCP and UDP with IPv4/IPv6 are supported for header/data split
+ [#accel]_.
+ - Informative
+
* - `rx[i]_lro_packets`
- The number of LRO packets received on ring i [#accel]_.
- Acceleration
@@ -300,6 +322,11 @@ the software port.
in the beginning of the queue. This is a normal condition.
- Informative
+ * - `tx[i]_timestamps`
+ - Transmitted packets that were hardware timestamped at the device's DMA
+ layer.
+ - Informative
+
* - `tx[i]_added_vlan_packets`
- The number of packets sent where vlan tag insertion was offloaded to the
hardware.
@@ -702,6 +729,12 @@ the software port.
the device typically ensures not posting the CQE.
- Error
+ * - `ptp_cq[i]_lost_cqe`
+ - Number of times a CQE is expected to not be delivered on the PTP
+ timestamping CQE by the device due to a time delta elapsing. If such a
+ CQE is somehow delivered, `ptp_cq[i]_late_cqe` is incremented.
+ - Error
+
.. [#ring_global] The corresponding ring and global counters do not share the
same name (i.e. do not follow the common naming scheme).
diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/kconfig.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/kconfig.rst
index 20d3b7e87049..34e911480108 100644
--- a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/kconfig.rst
+++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/kconfig.rst
@@ -130,6 +130,9 @@ Enabling the driver and kconfig options
| Build support for software-managed steering in the NIC.
+**CONFIG_MLX5_HW_STEERING=(y/n)**
+
+| Build support for hardware-managed steering in the NIC.
**CONFIG_MLX5_TC_CT=(y/n)**
diff --git a/Documentation/networking/device_drivers/ethernet/meta/fbnic.rst b/Documentation/networking/device_drivers/ethernet/meta/fbnic.rst
new file mode 100644
index 000000000000..04e0595bb0a7
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/meta/fbnic.rst
@@ -0,0 +1,72 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+=====================================
+Meta Platforms Host Network Interface
+=====================================
+
+Firmware Versions
+-----------------
+
+fbnic has three components stored on the flash which are provided in one PLDM
+image:
+
+1. fw - The control firmware used to view and modify firmware settings, request
+ firmware actions, and retrieve firmware counters outside of the data path.
+ This is the firmware which fbnic_fw.c interacts with.
+2. bootloader - The firmware which validate firmware security and control basic
+ operations including loading and updating the firmware. This is also known
+ as the cmrt firmware.
+3. undi - This is the UEFI driver which is based on the Linux driver.
+
+fbnic stores two copies of these three components on flash. This allows fbnic
+to fall back to an older version of firmware automatically in case firmware
+fails to boot. Version information for both is provided as running and stored.
+The undi is only provided in stored as it is not actively running once the Linux
+driver takes over.
+
+devlink dev info provides version information for all three components. In
+addition to the version the hg commit hash of the build is included as a
+separate entry.
+
+Statistics
+----------
+
+RPC (Rx parser)
+~~~~~~~~~~~~~~~
+
+ - ``rpc_unkn_etype``: frames containing unknown EtherType
+ - ``rpc_unkn_ext_hdr``: frames containing unknown IPv6 extension header
+ - ``rpc_ipv4_frag``: frames containing IPv4 fragment
+ - ``rpc_ipv6_frag``: frames containing IPv6 fragment
+ - ``rpc_ipv4_esp``: frames with IPv4 ESP encapsulation
+ - ``rpc_ipv6_esp``: frames with IPv6 ESP encapsulation
+ - ``rpc_tcp_opt_err``: frames which encountered TCP option parsing error
+ - ``rpc_out_of_hdr_err``: frames where header was larger than parsable region
+ - ``ovr_size_err``: oversized frames
+
+PCIe
+~~~~
+
+The fbnic driver exposes PCIe hardware performance statistics through debugfs
+(``pcie_stats``). These statistics provide insights into PCIe transaction
+behavior and potential performance bottlenecks.
+
+1. PCIe Transaction Counters:
+
+ These counters track PCIe transaction activity:
+ - ``pcie_ob_rd_tlp``: Outbound read Transaction Layer Packets count
+ - ``pcie_ob_rd_dword``: DWORDs transferred in outbound read transactions
+ - ``pcie_ob_wr_tlp``: Outbound write Transaction Layer Packets count
+ - ``pcie_ob_wr_dword``: DWORDs transferred in outbound write
+ transactions
+ - ``pcie_ob_cpl_tlp``: Outbound completion TLP count
+ - ``pcie_ob_cpl_dword``: DWORDs transferred in outbound completion TLPs
+
+2. PCIe Resource Monitoring:
+
+ These counters indicate PCIe resource exhaustion events:
+ - ``pcie_ob_rd_no_tag``: Read requests dropped due to tag unavailability
+ - ``pcie_ob_rd_no_cpl_cred``: Read requests dropped due to completion
+ credit exhaustion
+ - ``pcie_ob_rd_no_np_cred``: Read requests dropped due to non-posted
+ credit exhaustion
diff --git a/Documentation/networking/device_drivers/wwan/t7xx.rst b/Documentation/networking/device_drivers/wwan/t7xx.rst
index f346f5f85f15..e07de7700dfc 100644
--- a/Documentation/networking/device_drivers/wwan/t7xx.rst
+++ b/Documentation/networking/device_drivers/wwan/t7xx.rst
@@ -7,12 +7,13 @@
============================================
t7xx driver for MTK PCIe based T700 5G modem
============================================
-The t7xx driver is a WWAN PCIe host driver developed for linux or Chrome OS platforms
-for data exchange over PCIe interface between Host platform & MediaTek's T700 5G modem.
-The driver exposes an interface conforming to the MBIM protocol [1]. Any front end
-application (e.g. Modem Manager) could easily manage the MBIM interface to enable
-data communication towards WWAN. The driver also provides an interface to interact
-with the MediaTek's modem via AT commands.
+The t7xx driver is a WWAN PCIe host driver developed for linux or Chrome OS
+platforms for data exchange over PCIe interface between Host platform &
+MediaTek's T700 5G modem.
+The driver exposes an interface conforming to the MBIM protocol [1]. Any front
+end application (e.g. Modem Manager) could easily manage the MBIM interface to
+enable data communication towards WWAN. The driver also provides an interface
+to interact with the MediaTek's modem via AT commands.
Basic usage
===========
@@ -45,8 +46,8 @@ The driver provides sysfs interfaces to userspace.
t7xx_mode
---------
-The sysfs interface provides userspace with access to the device mode, this interface
-supports read and write operations.
+The sysfs interface provides userspace with access to the device mode, this
+interface supports read and write operations.
Device mode:
@@ -67,6 +68,28 @@ Write from userspace to set the device mode.
::
$ echo fastboot_switching > /sys/bus/pci/devices/${bdf}/t7xx_mode
+t7xx_debug_ports
+----------------
+The sysfs interface provides userspace with access to enable/disable the debug
+ports, this interface supports read and write operations.
+
+Debug port status:
+
+- ``1`` represents enable debug ports
+- ``0`` represents disable debug ports
+
+Currently supported debug ports (ADB/MIPC).
+
+Read from userspace to get the current debug ports status.
+
+::
+ $ cat /sys/bus/pci/devices/${bdf}/t7xx_debug_ports
+
+Write from userspace to set the debug ports status.
+
+::
+ $ echo 1 > /sys/bus/pci/devices/${bdf}/t7xx_debug_ports
+
Management application development
==================================
The driver and userspace interfaces are described below. The MBIM protocol is
@@ -139,6 +162,25 @@ Please note that driver needs to be reloaded to export /dev/wwan0fastboot0
port, because device needs a cold reset after enter ``fastboot_switching``
mode.
+ADB port userspace ABI
+----------------------
+
+/dev/wwan0adb0 character device
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The driver exposes a ADB protocol interface by implementing ADB WWAN Port.
+The userspace end of the ADB channel pipe is a /dev/wwan0adb0 character device.
+Application shall use this interface for ADB protocol communication.
+
+MIPC port userspace ABI
+-----------------------
+
+/dev/wwan0mipc0 character device
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The driver exposes a diagnostic interface by implementing MIPC (Modem
+Information Process Center) WWAN Port. The userspace end of the MIPC channel
+pipe is a /dev/wwan0mipc0 character device.
+Application shall use this interface for MTK modem diagnostic communication.
+
The MediaTek's T700 modem supports the 3GPP TS 27.007 [4] specification.
References
@@ -164,3 +206,9 @@ speak the Mobile Interface Broadband Model (MBIM) protocol"*
[5] *fastboot "a mechanism for communicating with bootloaders"*
- https://android.googlesource.com/platform/system/core/+/refs/heads/main/fastboot/README.md
+
+[6] *ADB (Android Debug Bridge) "a mechanism to keep track of Android devices
+and emulators instances connected to or running on a given host developer
+machine with ADB protocol"*
+
+- https://android.googlesource.com/platform/packages/modules/adb/+/refs/heads/main/README.md
diff --git a/Documentation/networking/devlink/devlink-info.rst b/Documentation/networking/devlink/devlink-info.rst
index 1242b0e6826b..23073bc219d8 100644
--- a/Documentation/networking/devlink/devlink-info.rst
+++ b/Documentation/networking/devlink/devlink-info.rst
@@ -146,6 +146,11 @@ board.manufacture
An identifier of the company or the facility which produced the part.
+board.part_number
+-----------------
+
+Part number of the board and its components.
+
fw
--
diff --git a/Documentation/networking/devlink/devlink-port.rst b/Documentation/networking/devlink/devlink-port.rst
index 562f46b41274..9d22d41a7cd1 100644
--- a/Documentation/networking/devlink/devlink-port.rst
+++ b/Documentation/networking/devlink/devlink-port.rst
@@ -134,6 +134,9 @@ Users may also set the IPsec crypto capability of the function using
Users may also set the IPsec packet capability of the function using
`devlink port function set ipsec_packet` command.
+Users may also set the maximum IO event queues of the function
+using `devlink port function set max_io_eqs` command.
+
Function attributes
===================
@@ -295,6 +298,36 @@ policy is processed in software by the kernel.
function:
hw_addr 00:00:00:00:00:00 ipsec_packet enabled
+Maximum IO events queues setup
+------------------------------
+When user sets maximum number of IO event queues for a SF or
+a VF, such function driver is limited to consume only enforced
+number of IO event queues.
+
+IO event queues deliver events related to IO queues, including network
+device transmit and receive queues (txq and rxq) and RDMA Queue Pairs (QPs).
+For example, the number of netdevice channels and RDMA device completion
+vectors are derived from the function's IO event queues. Usually, the number
+of interrupt vectors consumed by the driver is limited by the number of IO
+event queues per device, as each of the IO event queues is connected to an
+interrupt vector.
+
+- Get maximum IO event queues of the VF device::
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:00:00:00:00:00 ipsec_packet disabled max_io_eqs 10
+
+- Set maximum IO event queues of the VF device::
+
+ $ devlink port function set pci/0000:06:00.0/2 max_io_eqs 32
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:00:00:00:00:00 ipsec_packet disabled max_io_eqs 32
+
Subfunction
============
diff --git a/Documentation/networking/devlink/devlink-region.rst b/Documentation/networking/devlink/devlink-region.rst
index 9232cd7da301..5d0b68f752c0 100644
--- a/Documentation/networking/devlink/devlink-region.rst
+++ b/Documentation/networking/devlink/devlink-region.rst
@@ -49,7 +49,7 @@ example usage
$ devlink region show [ DEV/REGION ]
$ devlink region del DEV/REGION snapshot SNAPSHOT_ID
$ devlink region dump DEV/REGION [ snapshot SNAPSHOT_ID ]
- $ devlink region read DEV/REGION [ snapshot SNAPSHOT_ID ] address ADDRESS length length
+ $ devlink region read DEV/REGION [ snapshot SNAPSHOT_ID ] address ADDRESS length LENGTH
# Show all of the exposed regions with region sizes:
$ devlink region show
diff --git a/Documentation/networking/devlink/hns3.rst b/Documentation/networking/devlink/hns3.rst
index 4562a6e4782f..72bc1b9f3785 100644
--- a/Documentation/networking/devlink/hns3.rst
+++ b/Documentation/networking/devlink/hns3.rst
@@ -23,3 +23,8 @@ The ``hns3`` driver reports the following versions
* - ``fw``
- running
- Used to represent the firmware version.
+ * - ``fw.scc``
+ - running
+ - Used to represent the Soft Congestion Control (SSC) firmware version.
+ SCC is a firmware component which provides multiple RDMA congestion
+ control algorithms, including DCQCN.
diff --git a/Documentation/networking/devlink/ice.rst b/Documentation/networking/devlink/ice.rst
index 7f30ebd5debb..e3972d03cea0 100644
--- a/Documentation/networking/devlink/ice.rst
+++ b/Documentation/networking/devlink/ice.rst
@@ -11,6 +11,7 @@ Parameters
==========
.. list-table:: Generic parameters implemented
+ :widths: 5 5 90
* - Name
- Mode
@@ -21,6 +22,77 @@ Parameters
* - ``enable_iwarp``
- runtime
- mutually exclusive with ``enable_roce``
+ * - ``tx_scheduling_layers``
+ - permanent
+ - The ice hardware uses hierarchical scheduling for Tx with a fixed
+ number of layers in the scheduling tree. Each of them are decision
+ points. Root node represents a port, while all the leaves represent
+ the queues. This way of configuring the Tx scheduler allows features
+ like DCB or devlink-rate (documented below) to configure how much
+ bandwidth is given to any given queue or group of queues, enabling
+ fine-grained control because scheduling parameters can be configured
+ at any given layer of the tree.
+
+ The default 9-layer tree topology was deemed best for most workloads,
+ as it gives an optimal ratio of performance to configurability. However,
+ for some specific cases, this 9-layer topology might not be desired.
+ One example would be sending traffic to queues that are not a multiple
+ of 8. Because the maximum radix is limited to 8 in 9-layer topology,
+ the 9th queue has a different parent than the rest, and it's given
+ more bandwidth credits. This causes a problem when the system is
+ sending traffic to 9 queues:
+
+ | tx_queue_0_packets: 24163396
+ | tx_queue_1_packets: 24164623
+ | tx_queue_2_packets: 24163188
+ | tx_queue_3_packets: 24163701
+ | tx_queue_4_packets: 24163683
+ | tx_queue_5_packets: 24164668
+ | tx_queue_6_packets: 23327200
+ | tx_queue_7_packets: 24163853
+ | tx_queue_8_packets: 91101417 < Too much traffic is sent from 9th
+
+ To address this need, you can switch to a 5-layer topology, which
+ changes the maximum topology radix to 512. With this enhancement,
+ the performance characteristic is equal as all queues can be assigned
+ to the same parent in the tree. The obvious drawback of this solution
+ is a lower configuration depth of the tree.
+
+ Use the ``tx_scheduling_layer`` parameter with the devlink command
+ to change the transmit scheduler topology. To use 5-layer topology,
+ use a value of 5. For example:
+ $ devlink dev param set pci/0000:16:00.0 name tx_scheduling_layers
+ value 5 cmode permanent
+ Use a value of 9 to set it back to the default value.
+
+ You must do PCI slot powercycle for the selected topology to take effect.
+
+ To verify that value has been set:
+ $ devlink dev param show pci/0000:16:00.0 name tx_scheduling_layers
+.. list-table:: Driver specific parameters implemented
+ :widths: 5 5 90
+
+ * - Name
+ - Mode
+ - Description
+ * - ``local_forwarding``
+ - runtime
+ - Controls loopback behavior by tuning scheduler bandwidth.
+ It impacts all kinds of functions: physical, virtual and
+ subfunctions.
+ Supported values are:
+
+ ``enabled`` - loopback traffic is allowed on port
+
+ ``disabled`` - loopback traffic is not allowed on this port
+
+ ``prioritized`` - loopback traffic is prioritized on this port
+
+ Default value of ``local_forwarding`` parameter is ``enabled``.
+ ``prioritized`` provides ability to adjust loopback traffic rate to increase
+ one port capacity at cost of the another. User needs to disable
+ local forwarding on one of the ports in order have increased capacity
+ on the ``prioritized`` port.
Info versions
=============
diff --git a/Documentation/networking/devlink/mlx5.rst b/Documentation/networking/devlink/mlx5.rst
index 456985407475..41618538fc70 100644
--- a/Documentation/networking/devlink/mlx5.rst
+++ b/Documentation/networking/devlink/mlx5.rst
@@ -53,6 +53,9 @@ parameters.
* ``smfs`` Software managed flow steering. In SMFS mode, the HW
steering entities are created and manage through the driver without
firmware intervention.
+ * ``hmfs`` Hardware managed flow steering. In HMFS mode, the driver
+ is configuring steering rules directly to the HW using Work Queues with
+ a special new type of WQE (Work Queue Element).
SMFS mode is faster and provides better rule insertion rate compared to
default DMFS mode.
diff --git a/Documentation/networking/devlink/nfp.rst b/Documentation/networking/devlink/nfp.rst
index a1717db0dfcc..3093642bdae4 100644
--- a/Documentation/networking/devlink/nfp.rst
+++ b/Documentation/networking/devlink/nfp.rst
@@ -32,7 +32,7 @@ The ``nfp`` driver reports the following versions
- Description
* - ``board.id``
- fixed
- - Part number identifying the board design
+ - Identifier of the board design
* - ``board.rev``
- fixed
- Revision of the board design
@@ -42,6 +42,9 @@ The ``nfp`` driver reports the following versions
* - ``board.model``
- fixed
- Model name of the board design
+ * - ``board.part_number``
+ - fixed
+ - Part number of the board and its components
* - ``fw.bundle_id``
- stored, running
- Firmware bundle id
diff --git a/Documentation/networking/devlink/octeontx2.rst b/Documentation/networking/devlink/octeontx2.rst
index 610de99b728a..84206537aedb 100644
--- a/Documentation/networking/devlink/octeontx2.rst
+++ b/Documentation/networking/devlink/octeontx2.rst
@@ -40,3 +40,40 @@ The ``octeontx2 AF`` driver implements the following driver-specific parameters.
- runtime
- Use to set the quantum which hardware uses for scheduling among transmit queues.
Hardware uses weighted DWRR algorithm to schedule among all transmit queues.
+ * - ``npc_mcam_high_zone_percent``
+ - u8
+ - runtime
+ - Use to set the number of high priority zone entries in NPC MCAM that can be allocated
+ by a user, out of the three priority zone categories high, mid and low.
+ * - ``npc_def_rule_cntr``
+ - bool
+ - runtime
+ - Use to enable or disable hit counters for the default rules in NPC MCAM.
+ Its not guaranteed that counters gets enabled and mapped to all the default rules,
+ since the counters are scarce and driver follows a best effort approach.
+ The default rule serves as the primary packet steering rule for a specific PF or VF,
+ based on its DMAC address which is installed by AF driver as part of its initialization.
+ Sample command to read hit counters for default rule from debugfs is as follows,
+ cat /sys/kernel/debug/cn10k/npc/mcam_rules
+ * - ``nix_maxlf``
+ - u16
+ - runtime
+ - Use to set the maximum number of LFs in NIX hardware block. This would be useful
+ to increase the availability of default resources allocated to enabled LFs like
+ MCAM entries for example.
+
+The ``octeontx2 PF`` driver implements the following driver-specific parameters.
+
+.. list-table:: Driver-specific parameters implemented
+ :widths: 5 5 5 85
+
+ * - Name
+ - Type
+ - Mode
+ - Description
+ * - ``unicast_filter_count``
+ - u8
+ - runtime
+ - Set the maximum number of unicast filters that can be programmed for
+ the device. This can be used to achieve better device resource
+ utilization, avoiding over consumption of unused MCAM table entries.
diff --git a/Documentation/networking/devmem.rst b/Documentation/networking/devmem.rst
new file mode 100644
index 000000000000..d95363645331
--- /dev/null
+++ b/Documentation/networking/devmem.rst
@@ -0,0 +1,278 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================
+Device Memory TCP
+=================
+
+
+Intro
+=====
+
+Device memory TCP (devmem TCP) enables receiving data directly into device
+memory (dmabuf). The feature is currently implemented for TCP sockets.
+
+
+Opportunity
+-----------
+
+A large number of data transfers have device memory as the source and/or
+destination. Accelerators drastically increased the prevalence of such
+transfers. Some examples include:
+
+- Distributed training, where ML accelerators, such as GPUs on different hosts,
+ exchange data.
+
+- Distributed raw block storage applications transfer large amounts of data with
+ remote SSDs. Much of this data does not require host processing.
+
+Typically the Device-to-Device data transfers in the network are implemented as
+the following low-level operations: Device-to-Host copy, Host-to-Host network
+transfer, and Host-to-Device copy.
+
+The flow involving host copies is suboptimal, especially for bulk data transfers,
+and can put significant strains on system resources such as host memory
+bandwidth and PCIe bandwidth.
+
+Devmem TCP optimizes this use case by implementing socket APIs that enable
+the user to receive incoming network packets directly into device memory.
+
+Packet payloads go directly from the NIC to device memory.
+
+Packet headers go to host memory and are processed by the TCP/IP stack
+normally. The NIC must support header split to achieve this.
+
+Advantages:
+
+- Alleviate host memory bandwidth pressure, compared to existing
+ network-transfer + device-copy semantics.
+
+- Alleviate PCIe bandwidth pressure, by limiting data transfer to the lowest
+ level of the PCIe tree, compared to the traditional path which sends data
+ through the root complex.
+
+
+More Info
+---------
+
+ slides, video
+ https://netdevconf.org/0x17/sessions/talk/device-memory-tcp.html
+
+ patchset
+ [PATCH net-next v24 00/13] Device Memory TCP
+ https://lore.kernel.org/netdev/20240831004313.3713467-1-almasrymina@google.com/
+
+
+Interface
+=========
+
+
+Example
+-------
+
+tools/testing/selftests/net/ncdevmem.c:do_server shows an example of setting up
+the RX path of this API.
+
+
+NIC Setup
+---------
+
+Header split, flow steering, & RSS are required features for devmem TCP.
+
+Header split is used to split incoming packets into a header buffer in host
+memory, and a payload buffer in device memory.
+
+Flow steering & RSS are used to ensure that only flows targeting devmem land on
+an RX queue bound to devmem.
+
+Enable header split & flow steering::
+
+ # enable header split
+ ethtool -G eth1 tcp-data-split on
+
+
+ # enable flow steering
+ ethtool -K eth1 ntuple on
+
+Configure RSS to steer all traffic away from the target RX queue (queue 15 in
+this example)::
+
+ ethtool --set-rxfh-indir eth1 equal 15
+
+
+The user must bind a dmabuf to any number of RX queues on a given NIC using
+the netlink API::
+
+ /* Bind dmabuf to NIC RX queue 15 */
+ struct netdev_queue *queues;
+ queues = malloc(sizeof(*queues) * 1);
+
+ queues[0]._present.type = 1;
+ queues[0]._present.idx = 1;
+ queues[0].type = NETDEV_RX_QUEUE_TYPE_RX;
+ queues[0].idx = 15;
+
+ *ys = ynl_sock_create(&ynl_netdev_family, &yerr);
+
+ req = netdev_bind_rx_req_alloc();
+ netdev_bind_rx_req_set_ifindex(req, 1 /* ifindex */);
+ netdev_bind_rx_req_set_dmabuf_fd(req, dmabuf_fd);
+ __netdev_bind_rx_req_set_queues(req, queues, n_queue_index);
+
+ rsp = netdev_bind_rx(*ys, req);
+
+ dmabuf_id = rsp->dmabuf_id;
+
+
+The netlink API returns a dmabuf_id: a unique ID that refers to this dmabuf
+that has been bound.
+
+The user can unbind the dmabuf from the netdevice by closing the netlink socket
+that established the binding. We do this so that the binding is automatically
+unbound even if the userspace process crashes.
+
+Note that any reasonably well-behaved dmabuf from any exporter should work with
+devmem TCP, even if the dmabuf is not actually backed by devmem. An example of
+this is udmabuf, which wraps user memory (non-devmem) in a dmabuf.
+
+
+Socket Setup
+------------
+
+The socket must be flow steered to the dmabuf bound RX queue::
+
+ ethtool -N eth1 flow-type tcp4 ... queue 15
+
+
+Receiving data
+--------------
+
+The user application must signal to the kernel that it is capable of receiving
+devmem data by passing the MSG_SOCK_DEVMEM flag to recvmsg::
+
+ ret = recvmsg(fd, &msg, MSG_SOCK_DEVMEM);
+
+Applications that do not specify the MSG_SOCK_DEVMEM flag will receive an EFAULT
+on devmem data.
+
+Devmem data is received directly into the dmabuf bound to the NIC in 'NIC
+Setup', and the kernel signals such to the user via the SCM_DEVMEM_* cmsgs::
+
+ for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) {
+ if (cm->cmsg_level != SOL_SOCKET ||
+ (cm->cmsg_type != SCM_DEVMEM_DMABUF &&
+ cm->cmsg_type != SCM_DEVMEM_LINEAR))
+ continue;
+
+ dmabuf_cmsg = (struct dmabuf_cmsg *)CMSG_DATA(cm);
+
+ if (cm->cmsg_type == SCM_DEVMEM_DMABUF) {
+ /* Frag landed in dmabuf.
+ *
+ * dmabuf_cmsg->dmabuf_id is the dmabuf the
+ * frag landed on.
+ *
+ * dmabuf_cmsg->frag_offset is the offset into
+ * the dmabuf where the frag starts.
+ *
+ * dmabuf_cmsg->frag_size is the size of the
+ * frag.
+ *
+ * dmabuf_cmsg->frag_token is a token used to
+ * refer to this frag for later freeing.
+ */
+
+ struct dmabuf_token token;
+ token.token_start = dmabuf_cmsg->frag_token;
+ token.token_count = 1;
+ continue;
+ }
+
+ if (cm->cmsg_type == SCM_DEVMEM_LINEAR)
+ /* Frag landed in linear buffer.
+ *
+ * dmabuf_cmsg->frag_size is the size of the
+ * frag.
+ */
+ continue;
+
+ }
+
+Applications may receive 2 cmsgs:
+
+- SCM_DEVMEM_DMABUF: this indicates the fragment landed in the dmabuf indicated
+ by dmabuf_id.
+
+- SCM_DEVMEM_LINEAR: this indicates the fragment landed in the linear buffer.
+ This typically happens when the NIC is unable to split the packet at the
+ header boundary, such that part (or all) of the payload landed in host
+ memory.
+
+Applications may receive no SO_DEVMEM_* cmsgs. That indicates non-devmem,
+regular TCP data that landed on an RX queue not bound to a dmabuf.
+
+
+Freeing frags
+-------------
+
+Frags received via SCM_DEVMEM_DMABUF are pinned by the kernel while the user
+processes the frag. The user must return the frag to the kernel via
+SO_DEVMEM_DONTNEED::
+
+ ret = setsockopt(client_fd, SOL_SOCKET, SO_DEVMEM_DONTNEED, &token,
+ sizeof(token));
+
+The user must ensure the tokens are returned to the kernel in a timely manner.
+Failure to do so will exhaust the limited dmabuf that is bound to the RX queue
+and will lead to packet drops.
+
+The user must pass no more than 128 tokens, with no more than 1024 total frags
+among the token->token_count across all the tokens. If the user provides more
+than 1024 frags, the kernel will free up to 1024 frags and return early.
+
+The kernel returns the number of actual frags freed. The number of frags freed
+can be less than the tokens provided by the user in case of:
+
+(a) an internal kernel leak bug.
+(b) the user passed more than 1024 frags.
+
+Implementation & Caveats
+========================
+
+Unreadable skbs
+---------------
+
+Devmem payloads are inaccessible to the kernel processing the packets. This
+results in a few quirks for payloads of devmem skbs:
+
+- Loopback is not functional. Loopback relies on copying the payload, which is
+ not possible with devmem skbs.
+
+- Software checksum calculation fails.
+
+- TCP Dump and bpf can't access devmem packet payloads.
+
+
+Testing
+=======
+
+More realistic example code can be found in the kernel source under
+``tools/testing/selftests/net/ncdevmem.c``
+
+ncdevmem is a devmem TCP netcat. It works very similarly to netcat, but
+receives data directly into a udmabuf.
+
+To run ncdevmem, you need to run it on a server on the machine under test, and
+you need to run netcat on a peer to provide the TX data.
+
+ncdevmem has a validation mode as well that expects a repeating pattern of
+incoming data and validates it as such. For example, you can launch
+ncdevmem on the server by::
+
+ ncdevmem -s <server IP> -c <client IP> -f eth1 -d 3 -n 0000:06:00.0 -l \
+ -p 5201 -v 7
+
+On client side, use regular netcat to send TX data to ncdevmem process
+on the server::
+
+ yes $(echo -e \\x01\\x02\\x03\\x04\\x05\\x06) | \
+ tr \\n \\0 | head -c 5G | nc <server IP> 5201 -p 5201
diff --git a/Documentation/networking/diagnostic/index.rst b/Documentation/networking/diagnostic/index.rst
new file mode 100644
index 000000000000..86488aa46b48
--- /dev/null
+++ b/Documentation/networking/diagnostic/index.rst
@@ -0,0 +1,17 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================
+Networking Diagnostics
+======================
+
+.. toctree::
+ :maxdepth: 2
+
+ twisted_pair_layer1_diagnostics.rst
+
+.. only:: subproject and html
+
+ Indices
+ =======
+
+ * :ref:`genindex`
diff --git a/Documentation/networking/diagnostic/twisted_pair_layer1_diagnostics.rst b/Documentation/networking/diagnostic/twisted_pair_layer1_diagnostics.rst
new file mode 100644
index 000000000000..079e17effadf
--- /dev/null
+++ b/Documentation/networking/diagnostic/twisted_pair_layer1_diagnostics.rst
@@ -0,0 +1,784 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Diagnostic Concept for Investigating Twisted Pair Ethernet Variants at OSI Layer 1
+==================================================================================
+
+Introduction
+------------
+
+This documentation is designed for two primary audiences:
+
+1. **Users and System Administrators**: For those dealing with real-world
+ Ethernet issues, this guide provides a practical, step-by-step
+ troubleshooting flow to help identify and resolve common problems in Twisted
+ Pair Ethernet at OSI Layer 1. If you're facing unstable links, speed drops,
+ or mysterious network issues, jump right into the step-by-step guide and
+ follow it through to find your solution.
+
+2. **Kernel Developers**: For developers working with network drivers and PHY
+ support, this documentation outlines the diagnostic process and highlights
+ areas where the Linux kernel’s diagnostic interfaces could be extended or
+ improved. By understanding the diagnostic flow, developers can better
+ prioritize future enhancements.
+
+Step-by-Step Diagnostic Guide from Linux (General Ethernet)
+-----------------------------------------------------------
+
+This diagnostic guide covers common Ethernet troubleshooting scenarios,
+focusing on **link stability and detection** across different Ethernet
+environments, including **Single-Pair Ethernet (SPE)** and **Multi-Pair
+Ethernet (MPE)**, as well as power delivery technologies like **PoDL** (Power
+over Data Line) and **PoE** (Clause 33 PSE).
+
+The guide is designed to help users diagnose physical layer (Layer 1) issues on
+systems running **Linux kernel version 6.11 or newer**, utilizing **ethtool
+version 6.10 or later** and **iproute2 version 6.4.0 or later**.
+
+In this guide, we assume that users may have **limited or no access to the link
+partner** and will focus on diagnosing issues locally.
+
+Diagnostic Scenarios
+~~~~~~~~~~~~~~~~~~~~
+
+- **Link is up and stable, but no data transfer**: If the link is stable but
+ there are issues with data transmission, refer to the **OSI Layer 2
+ Troubleshooting Guide**.
+
+- **Link is unstable**: Link resets, speed drops, or other fluctuations
+ indicate potential issues at the hardware or physical layer.
+
+- **No link detected**: The interface is up, but no link is established.
+
+Verify Interface Status
+~~~~~~~~~~~~~~~~~~~~~~~
+
+Begin by verifying the status of the Ethernet interface to check if it is
+administratively up. Unlike `ethtool`, which provides information on the link
+and PHY status, it does not show the **administrative state** of the interface.
+To check this, you should use the `ip` command, which describes the interface
+state within the angle brackets `"<>"` in its output.
+
+For example, in the output `<NO-CARRIER,BROADCAST,MULTICAST,UP>`, the important
+keywords are:
+
+- **UP**: The interface is in the administrative "UP" state.
+- **NO-CARRIER**: The interface is administratively up, but no physical link is
+ detected.
+
+If the output shows `<BROADCAST,MULTICAST>`, this indicates the interface is in
+the administrative "DOWN" state.
+
+- **Command:** `ip link show dev <interface>`
+
+- **Expected Output:**
+
+ .. code-block:: bash
+
+ 4: eth0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 ...
+ link/ether 88:14:2b:00:96:f2 brd ff:ff:ff:ff:ff:ff
+
+- **Interpreting the Output:**
+
+ - **Administrative UP State**:
+
+ - If the output contains **"UP"**, the interface is administratively up,
+ and the system is trying to establish a physical link.
+
+ - If you also see **"NO-CARRIER"**, it means the physical link has not been
+ detected, indicating potential Layer 1 issues like a cable fault,
+ misconfiguration, or no connection at the link partner. In this case,
+ proceed to the **Inspect Link Status and PHY Configuration** section.
+
+ - **Administrative DOWN State**:
+
+ - If the output lacks **"UP"** and shows only states like
+ **"<BROADCAST,MULTICAST>"**, it means the interface is administratively
+ down. In this case, bring the interface up using the following command:
+
+ .. code-block:: bash
+
+ ip link set dev <interface> up
+
+- **Next Steps**:
+
+ - If the interface is **administratively up** but shows **NO-CARRIER**,
+ proceed to the **Inspect Link Status and PHY Configuration** section to
+ troubleshoot potential physical layer issues.
+
+ - If the interface was **administratively down** and you have brought it up,
+ ensure to **repeat this verification step** to confirm the new state of the
+ interface before proceeding
+
+ - **If the interface is up and the link is detected**:
+
+ - If the output shows **"UP"** and there is **no `NO-CARRIER`**, the
+ interface is administratively up, and the physical link has been
+ successfully established. If everything is working as expected, the Layer
+ 1 diagnostics are complete, and no further action is needed.
+
+ - If the interface is up and the link is detected but **no data is being
+ transferred**, the issue is likely beyond Layer 1, and you should proceed
+ with diagnosing the higher layers of the OSI model. This may involve
+ checking Layer 2 configurations (such as VLANs or MAC address issues),
+ Layer 3 settings (like IP addresses, routing, or ARP), or Layer 4 and
+ above (firewalls, services, etc.).
+
+ - If the **link is unstable** or **frequently resetting or dropping**, this
+ may indicate a physical layer issue such as a faulty cable, interference,
+ or power delivery problems. In this case, proceed with the next step in
+ this guide.
+
+Inspect Link Status and PHY Configuration
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Use `ethtool -I` to check the link status, PHY configuration, supported link
+modes, and additional statistics such as the **Link Down Events** counter. This
+step is essential for diagnosing Layer 1 problems such as speed mismatches,
+duplex issues, and link instability.
+
+For both **Single-Pair Ethernet (SPE)** and **Multi-Pair Ethernet (MPE)**
+devices, you will use this step to gather key details about the link. **SPE**
+links generally support a single speed and mode without autonegotiation (with
+the exception of **10BaseT1L**), while **MPE** devices typically support
+multiple link modes and autonegotiation.
+
+- **Command:** `ethtool -I <interface>`
+
+- **Example Output for SPE Interface (Non-autonegotiation)**:
+
+ .. code-block:: bash
+
+ Settings for spe4:
+ Supported ports: [ TP ]
+ Supported link modes: 100baseT1/Full
+ Supported pause frame use: No
+ Supports auto-negotiation: No
+ Supported FEC modes: Not reported
+ Advertised link modes: Not applicable
+ Advertised pause frame use: No
+ Advertised auto-negotiation: No
+ Advertised FEC modes: Not reported
+ Speed: 100Mb/s
+ Duplex: Full
+ Auto-negotiation: off
+ master-slave cfg: forced slave
+ master-slave status: slave
+ Port: Twisted Pair
+ PHYAD: 6
+ Transceiver: external
+ MDI-X: Unknown
+ Supports Wake-on: d
+ Wake-on: d
+ Link detected: yes
+ SQI: 7/7
+ Link Down Events: 2
+
+- **Example Output for MPE Interface (Autonegotiation)**:
+
+ .. code-block:: bash
+
+ Settings for eth1:
+ Supported ports: [ TP MII ]
+ Supported link modes: 10baseT/Half 10baseT/Full
+ 100baseT/Half 100baseT/Full
+ Supported pause frame use: Symmetric Receive-only
+ Supports auto-negotiation: Yes
+ Supported FEC modes: Not reported
+ Advertised link modes: 10baseT/Half 10baseT/Full
+ 100baseT/Half 100baseT/Full
+ Advertised pause frame use: Symmetric Receive-only
+ Advertised auto-negotiation: Yes
+ Advertised FEC modes: Not reported
+ Link partner advertised link modes: 10baseT/Half 10baseT/Full
+ 100baseT/Half 100baseT/Full
+ Link partner advertised pause frame use: Symmetric Receive-only
+ Link partner advertised auto-negotiation: Yes
+ Link partner advertised FEC modes: Not reported
+ Speed: 100Mb/s
+ Duplex: Full
+ Auto-negotiation: on
+ Port: Twisted Pair
+ PHYAD: 10
+ Transceiver: internal
+ MDI-X: Unknown
+ Supports Wake-on: pg
+ Wake-on: p
+ Link detected: yes
+ Link Down Events: 1
+
+- **Next Steps**:
+
+ - Record the output provided by `ethtool`, particularly noting the
+ **master-slave status**, **speed**, **duplex**, and other relevant fields.
+ This information will be useful for further analysis or troubleshooting.
+ Once the **ethtool** output has been collected and stored, move on to the
+ next diagnostic step.
+
+Check Power Delivery (PoDL or PoE)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If it is known that **PoDL** or **PoE** is **not implemented** on the system,
+or the **PSE** (Power Sourcing Equipment) is managed by proprietary user-space
+software or external tools, you can skip this step. In such cases, verify power
+delivery through alternative methods, such as checking hardware indicators
+(LEDs), using multimeters, or consulting vendor-specific software for
+monitoring power status.
+
+If **PoDL** or **PoE** is implemented and managed directly by Linux, follow
+these steps to ensure power is being delivered correctly:
+
+- **Command:** `ethtool --show-pse <interface>`
+
+- **Expected Output Examples**:
+
+ 1. **PSE Not Supported**:
+
+ If no PSE is attached or the interface does not support PSE, the following
+ output is expected:
+
+ .. code-block:: bash
+
+ netlink error: No PSE is attached
+ netlink error: Operation not supported
+
+ 2. **PoDL (Single-Pair Ethernet)**:
+
+ When PoDL is implemented, you might see the following attributes:
+
+ .. code-block:: bash
+
+ PSE attributes for eth1:
+ PoDL PSE Admin State: enabled
+ PoDL PSE Power Detection Status: delivering power
+
+ 3. **PoE (Clause 33 PSE)**:
+
+ For standard PoE, the output may look like this:
+
+ .. code-block:: bash
+
+ PSE attributes for eth1:
+ Clause 33 PSE Admin State: enabled
+ Clause 33 PSE Power Detection Status: delivering power
+ Clause 33 PSE Available Power Limit: 18000
+
+- **Adjust Power Limit (if needed)**:
+
+ - Sometimes, the available power limit may not be sufficient for the link
+ partner. You can increase the power limit as needed.
+
+ - **Command:** `ethtool --set-pse <interface> c33-pse-avail-pw-limit <limit>`
+
+ Example:
+
+ .. code-block:: bash
+
+ ethtool --set-pse eth1 c33-pse-avail-pw-limit 18000
+ ethtool --show-pse eth1
+
+ **Expected Output** after adjusting the power limit:
+
+ .. code-block:: bash
+
+ Clause 33 PSE Available Power Limit: 18000
+
+
+- **Next Steps**:
+
+ - **PoE or PoDL Not Used**: If **PoE** or **PoDL** is not implemented or used
+ on the system, proceed to the next diagnostic step, as power delivery is
+ not relevant for this setup.
+
+ - **PoE or PoDL Controlled Externally**: If **PoE** or **PoDL** is used but
+ is not managed by the Linux kernel's **PSE-PD** framework (i.e., it is
+ controlled by proprietary user-space software or external tools), this part
+ is out of scope for this documentation. Please consult vendor-specific
+ documentation or external tools for monitoring and managing power delivery.
+
+ - **PSE Admin State Disabled**:
+
+ - If the `PSE Admin State:` is **disabled**, enable it by running one of
+ the following commands:
+
+ .. code-block:: bash
+
+ ethtool --set-pse <devname> podl-pse-admin-control enable
+
+ or, for Clause 33 PSE (PoE):
+
+ ethtool --set-pse <devname> c33-pse-admin-control enable
+
+ - After enabling the PSE Admin State, return to the start of the **Check
+ Power Delivery (PoDL or PoE)** step to recheck the power delivery status.
+
+ - **Power Not Delivered**: If the `Power Detection Status` shows something
+ other than "delivering power" (e.g., `over current`), troubleshoot the
+ **PSE**. Check for potential issues such as a short circuit in the cable,
+ insufficient power delivery, or a fault in the PSE itself.
+
+ - **Power Delivered but No Link**: If power is being delivered but no link is
+ established, proceed with further diagnostics by performing **Cable
+ Diagnostics** or reviewing the **Inspect Link Status and PHY
+ Configuration** steps to identify any underlying issues with the physical
+ link or settings.
+
+Cable Diagnostics
+~~~~~~~~~~~~~~~~~
+
+Use `ethtool` to test for physical layer issues such as cable faults. The test
+results can vary depending on the cable's condition, the technology in use, and
+the state of the link partner. The results from the cable test will help in
+diagnosing issues like open circuits, shorts, impedance mismatches, and
+noise-related problems.
+
+- **Command:** `ethtool --cable-test <interface>`
+
+The following are the typical outputs for **Single-Pair Ethernet (SPE)** and
+**Multi-Pair Ethernet (MPE)**:
+
+- **For Single-Pair Ethernet (SPE)**:
+ - **Expected Output (SPE)**:
+
+ .. code-block:: bash
+
+ Cable test completed for device eth1.
+ Pair A, fault length: 25.00m
+ Pair A code Open Circuit
+
+ This indicates an open circuit or cable fault at the reported distance, but
+ results can be influenced by the link partner's state. Refer to the
+ **"Troubleshooting Based on Cable Test Results"** section for further
+ interpretation of these results.
+
+- **For Multi-Pair Ethernet (MPE)**:
+ - **Expected Output (MPE)**:
+
+ .. code-block:: bash
+
+ Cable test completed for device eth0.
+ Pair A code OK
+ Pair B code OK
+ Pair C code Open Circuit
+
+ Here, Pair C is reported as having an open circuit, while Pairs A and B are
+ functioning correctly. However, if autonegotiation is in use on Pairs A and
+ B, the cable test may be disrupted. Refer to the **"Troubleshooting Based on
+ Cable Test Results"** section for a detailed explanation of these issues and
+ how to resolve them.
+
+For detailed descriptions of the different possible cable test results, please
+refer to the **"Troubleshooting Based on Cable Test Results"** section.
+
+Troubleshooting Based on Cable Test Results
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+After running the cable test, the results can help identify specific issues in
+the physical connection. However, it is important to note that **cable testing
+results heavily depend on the capabilities and characteristics of both the
+local hardware and the link partner**. The accuracy and reliability of the
+results can vary significantly between different hardware implementations.
+
+In some cases, this can introduce **blind spots** in the current cable testing
+implementation, where certain results may not accurately reflect the actual
+physical state of the cable. For example:
+
+- An **Open Circuit** result might not only indicate a damaged or disconnected
+ cable but also occur if the cable is properly attached to a powered-down link
+ partner.
+
+- Some PHYs may report a **Short within Pair** if the link partner is in
+ **forced slave mode**, even though there is no actual short in the cable.
+
+To help users interpret the results more effectively, it could be beneficial to
+extend the **kernel UAPI** (User API) to provide additional context or
+**possible variants** of issues based on the hardware’s characteristics. Since
+these quirks are often hardware-specific, the **kernel driver** would be an
+ideal source of such information. By providing flags or hints related to
+potential false positives for each test result, users would have a better
+understanding of what to verify and where to investigate further.
+
+Until such improvements are made, users should be aware of these limitations
+and manually verify cable issues as needed. Physical inspections may help
+resolve uncertainties related to false positive results.
+
+The results can be one of the following:
+
+- **OK**:
+
+ - The cable is functioning correctly, and no issues were detected.
+
+ - **Next Steps**: If you are still experiencing issues, it might be related
+ to higher-layer problems, such as duplex mismatches or speed negotiation,
+ which are not physical-layer issues.
+
+ - **Special Case for `BaseT1` (1000/100/10BaseT1)**: In `BaseT1` systems, an
+ "OK" result typically also means that the link is up and likely in **slave
+ mode**, since cable tests usually only pass in this mode. For some
+ **10BaseT1L** PHYs, an "OK" result may occur even if the cable is too long
+ for the PHY's configured range (for example, when the range is configured
+ for short-distance mode).
+
+- **Open Circuit**:
+
+ - An **Open Circuit** result typically indicates that the cable is damaged or
+ disconnected at the reported fault length. Consider these possibilities:
+
+ - If the link partner is in **admin down** state or powered off, you might
+ still get an "Open Circuit" result even if the cable is functional.
+
+ - **Next Steps**: Inspect the cable at the fault length for visible damage
+ or loose connections. Verify the link partner is powered on and in the
+ correct mode.
+
+- **Short within Pair**:
+
+ - A **Short within Pair** indicates an unintended connection within the same
+ pair of wires, typically caused by physical damage to the cable.
+
+ - **Next Steps**: Replace or repair the cable and check for any physical
+ damage or improperly crimped connectors.
+
+- **Short to Another Pair**:
+
+ - A **Short to Another Pair** means the wires from different pairs are
+ shorted, which could occur due to physical damage or incorrect wiring.
+
+ - **Next Steps**: Replace or repair the damaged cable. Inspect the cable for
+ incorrect terminations or pinched wiring.
+
+- **Impedance Mismatch**:
+
+ - **Impedance Mismatch** indicates a reflection caused by an impedance
+ discontinuity in the cable. This can happen when a part of the cable has
+ abnormal impedance (e.g., when different cable types are spliced together
+ or when there is a defect in the cable).
+
+ - **Next Steps**: Check the cable quality and ensure consistent impedance
+ throughout its length. Replace any sections of the cable that do not meet
+ specifications.
+
+- **Noise**:
+
+ - **Noise** means that the Time Domain Reflectometry (TDR) test could not
+ complete due to excessive noise on the cable, which can be caused by
+ interference from electromagnetic sources.
+
+ - **Next Steps**: Identify and eliminate sources of electromagnetic
+ interference (EMI) near the cable. Consider using shielded cables or
+ rerouting the cable away from noise sources.
+
+- **Resolution Not Possible**:
+
+ - **Resolution Not Possible** means that the TDR test could not detect the
+ issue due to the resolution limitations of the test or because the fault is
+ beyond the distance that the test can measure.
+
+ - **Next Steps**: Inspect the cable manually if possible, or use alternative
+ diagnostic tools that can handle greater distances or higher resolution.
+
+- **Unknown**:
+
+ - An **Unknown** result may occur when the test cannot classify the fault or
+ when a specific issue is outside the scope of the tool's detection
+ capabilities.
+
+ - **Next Steps**: Re-run the test, verify the link partner's state, and inspect
+ the cable manually if necessary.
+
+Verify Link Partner PHY Configuration
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If the cable test passes but the link is still not functioning correctly, it’s
+essential to verify the configuration of the link partner’s PHY. Mismatches in
+speed, duplex settings, or master-slave roles can cause connection issues.
+
+Autonegotiation Mismatch
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+- If both link partners support autonegotiation, ensure that autonegotiation is
+ enabled on both sides and that all supported link modes are advertised. A
+ mismatch can lead to connectivity problems or sub optimal performance.
+
+- **Quick Fix:** Reset autonegotiation to the default settings, which will
+ advertise all default link modes:
+
+ .. code-block:: bash
+
+ ethtool -s <interface> autoneg on
+
+- **Command to check configuration:** `ethtool <interface>`
+
+- **Expected Output:** Ensure that both sides advertise compatible link modes.
+ If autonegotiation is off, verify that both link partners are configured for
+ the same speed and duplex.
+
+ The following example shows a case where the local PHY advertises fewer link
+ modes than it supports. This will reduce the number of overlapping link modes
+ with the link partner. In the worst case, there will be no common link modes,
+ and the link will not be created:
+
+ .. code-block:: bash
+
+ Settings for eth0:
+ Supported link modes: 1000baseT/Full, 100baseT/Full
+ Advertised link modes: 1000baseT/Full
+ Speed: 1000Mb/s
+ Duplex: Full
+ Auto-negotiation: on
+
+Combined Mode Mismatch (Autonegotiation on One Side, Forced on the Other)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+- One possible issue occurs when one side is using **autonegotiation** (as in
+ most modern systems), and the other side is set to a **forced link mode**
+ (e.g., older hardware with single-speed hubs). In such cases, modern PHYs
+ will attempt to detect the forced mode on the other side. If the link is
+ established, you may notice:
+
+ - **No or empty "Link partner advertised link modes"**.
+
+ - **"Link partner advertised auto-negotiation:"** will be **"no"** or not
+ present.
+
+- This type of detection does not always work reliably:
+
+ - Typically, the modern PHY will default to **Half Duplex**, even if the link
+ partner is actually configured for **Full Duplex**.
+
+ - Some PHYs may not work reliably if the link partner switches from one
+ forced mode to another. In this case, only a down/up cycle may help.
+
+- **Next Steps**: Set both sides to the same fixed speed and duplex mode to
+ avoid potential detection issues.
+
+ .. code-block:: bash
+
+ ethtool -s <interface> speed 1000 duplex full autoneg off
+
+Master/Slave Role Mismatch (BaseT1 and 1000BaseT PHYs)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+- In **BaseT1** systems (e.g., 1000BaseT1, 100BaseT1), link establishment
+ requires that one device is configured as **master** and the other as
+ **slave**. A mismatch in this master-slave configuration can prevent the link
+ from being established. However, **1000BaseT** also supports configurable
+ master/slave roles and can face similar issues.
+
+- **Role Preference in 1000BaseT**: The **1000BaseT** specification allows link
+ partners to negotiate master-slave roles or role preferences during
+ autonegotiation. Some PHYs have hardware limitations or bugs that prevent
+ them from functioning properly in certain roles. In such cases, drivers may
+ force these PHYs into a specific role (e.g., **forced master** or **forced
+ slave**) or try a weaker option by setting preferences. If both link partners
+ have the same issue and are forced into the same mode (e.g., both forced into
+ master mode), they will not be able to establish a link.
+
+- **Next Steps**: Ensure that one side is configured as **master** and the
+ other as **slave** to avoid this issue, particularly when hardware
+ limitations are involved, or try the weaker **preferred** option instead of
+ **forced**. Check for any driver-related restrictions or forced modes.
+
+- **Command to force master/slave mode**:
+
+ .. code-block:: bash
+
+ ethtool -s <interface> master-slave forced-master
+
+ or:
+
+ .. code-block:: bash
+
+ ethtool -s <interface> master-slave forced-master speed 1000 duplex full autoneg off
+
+
+- **Check the current master/slave status**:
+
+ .. code-block:: bash
+
+ ethtool <interface>
+
+ Example Output:
+
+ .. code-block:: bash
+
+ master-slave cfg: forced-master
+ master-slave status: master
+
+- **Hardware Bugs and Driver Forcing**: If a known hardware issue forces the
+ PHY into a specific mode, it’s essential to check the driver source code or
+ hardware documentation for details. Ensure that the roles are compatible
+ across both link partners, and if both PHYs are forced into the same mode,
+ adjust one side accordingly to resolve the mismatch.
+
+Monitor Link Resets and Speed Drops
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If the link is unstable, showing frequent resets or speed drops, this may
+indicate issues with the cable, PHY configuration, or environmental factors.
+While there is still no completely unified way in Linux to directly monitor
+downshift events or link speed changes via user space tools, both the Linux
+kernel logs and `ethtool` can provide valuable insights, especially if the
+driver supports reporting such events.
+
+- **Monitor Kernel Logs for Link Resets and Speed Drops**:
+
+ - The Linux kernel will print link status changes, including downshift
+ events, in the system logs. These messages typically include speed changes,
+ duplex mode, and downshifted link speed (if the driver supports it).
+
+ - **Command to monitor kernel logs in real-time:**
+
+ .. code-block:: bash
+
+ dmesg -w | grep "Link is Up\|Link is Down"
+
+ - Example Output (if a downshift occurs):
+
+ .. code-block:: bash
+
+ eth0: Link is Up - 100Mbps/Full (downshifted) - flow control rx/tx
+ eth0: Link is Down
+
+ This indicates that the link has been established but has downshifted from
+ a higher speed.
+
+ - **Note**: Not all drivers or PHYs support downshift reporting, so you may
+ not see this information for all devices.
+
+- **Monitor Link Down Events Using `ethtool`**:
+
+ - Starting with the latest kernel and `ethtool` versions, you can track
+ **Link Down Events** using the `ethtool -I` command. This will provide
+ counters for link drops, helping to diagnose link instability issues if
+ supported by the driver.
+
+ - **Command to monitor link down events:**
+
+ .. code-block:: bash
+
+ ethtool -I <interface>
+
+ - Example Output (if supported):
+
+ .. code-block:: bash
+
+ PSE attributes for eth1:
+ Link Down Events: 5
+
+ This indicates that the link has dropped 5 times. Frequent link down events
+ may indicate cable or environmental issues that require further
+ investigation.
+
+- **Check Link Status and Speed**:
+
+ - Even though downshift counts or events are not easily tracked, you can
+ still use `ethtool` to manually check the current link speed and status.
+
+ - **Command:** `ethtool <interface>`
+
+ - **Expected Output:**
+
+ .. code-block:: bash
+
+ Speed: 1000Mb/s
+ Duplex: Full
+ Auto-negotiation: on
+ Link detected: yes
+
+ Any inconsistencies in the expected speed or duplex setting could indicate
+ an issue.
+
+- **Disable Energy-Efficient Ethernet (EEE) for Diagnostics**:
+
+ - **EEE** (Energy-Efficient Ethernet) can be a source of link instability due
+ to transitions in and out of low-power states. For diagnostic purposes, it
+ may be useful to **temporarily** disable EEE to determine if it is
+ contributing to link instability. This is **not a generic recommendation**
+ for disabling power management.
+
+ - **Next Steps**: Disable EEE and monitor if the link becomes stable. If
+ disabling EEE resolves the issue, report the bug so that the driver can be
+ fixed.
+
+ - **Command:**
+
+ .. code-block:: bash
+
+ ethtool --set-eee <interface> eee off
+
+ - **Important**: If disabling EEE resolves the instability, the issue should
+ be reported to the maintainers as a bug, and the driver should be corrected
+ to handle EEE properly without causing instability. Disabling EEE
+ permanently should not be seen as a solution.
+
+- **Monitor Error Counters**:
+
+ - Use `ethtool -S <interface> --all-groups` to retrieve standardized interface
+ statistics if the driver supports the unified interface:
+
+ - **Command:** `ethtool -S <interface> --all-groups`
+
+ - **Example Output (if supported)**:
+
+ .. code-block:: bash
+
+ phydev-RxFrames: 100391
+ phydev-RxErrors: 0
+ phydev-TxFrames: 9
+ phydev-TxErrors: 0
+
+ - If the unified interface is not supported, use `ethtool -S <interface>` to
+ retrieve MAC and PHY counters. Note that non-standardized PHY counter names
+ vary by driver and must be interpreted accordingly:
+
+ - **Command:** `ethtool -S <interface>`
+
+ - **Example Output (if supported)**:
+
+ .. code-block:: bash
+
+ rx_crc_errors: 123
+ tx_errors: 45
+ rx_frame_errors: 78
+
+ - **Note**: If no meaningful error counters are available or if counters are
+ not supported, you may need to rely on physical inspections (e.g., cable
+ condition) or kernel log messages (e.g., link up/down events) to further
+ diagnose the issue.
+
+ - **Compare Counters**:
+
+ - Compare the egress and ingress frame counts reported by the PHY and MAC.
+
+ - A small difference may occur due to sampling rate differences between the
+ MAC and PHY drivers, or if the PHY and MAC are not always fully
+ synchronized in their UP or DOWN states.
+
+ - Significant discrepancies indicate potential issues in the data path
+ between the MAC and PHY.
+
+When All Else Fails...
+~~~~~~~~~~~~~~~~~~~~~~
+
+So you've checked the cables, monitored the logs, disabled EEE, and still...
+nothing? Don’t worry, you’re not alone. Sometimes, Ethernet gremlins just don’t
+want to cooperate.
+
+But before you throw in the towel (or the Ethernet cable), take a deep breath.
+It’s always possible that:
+
+1. Your PHY has a unique, undocumented personality.
+
+2. The problem is lying dormant, waiting for just the right moment to magically
+ resolve itself (hey, it happens!).
+
+3. Or, it could be that the ultimate solution simply hasn’t been invented yet.
+
+If none of the above bring you comfort, there’s one final step: contribute! If
+you've uncovered new or unusual issues, or have creative diagnostic methods,
+feel free to share your findings and extend this documentation. Together, we
+can hunt down every elusive network issue - one twisted pair at a time.
+
+Remember: sometimes the solution is just a reboot away, but if not, it’s time to
+dig deeper - or report that bug!
+
diff --git a/Documentation/networking/dns_resolver.rst b/Documentation/networking/dns_resolver.rst
index add4d59a99a5..c0364f7070af 100644
--- a/Documentation/networking/dns_resolver.rst
+++ b/Documentation/networking/dns_resolver.rst
@@ -118,7 +118,7 @@ Keys of dns_resolver type can be read from userspace using keyctl_read() or
Mechanism
=========
-The dnsresolver module registers a key type called "dns_resolver". Keys of
+The dns_resolver module registers a key type called "dns_resolver". Keys of
this type are used to transport and cache DNS lookup results from userspace.
When dns_query() is invoked, it calls request_key() to search the local
@@ -152,4 +152,4 @@ Debugging
Debugging messages can be turned on dynamically by writing a 1 into the
following file::
- /sys/module/dnsresolver/parameters/debug
+ /sys/module/dns_resolver/parameters/debug
diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst
index d583d9abf2f8..3770a2294509 100644
--- a/Documentation/networking/ethtool-netlink.rst
+++ b/Documentation/networking/ethtool-netlink.rst
@@ -57,6 +57,7 @@ Structure of this header is
``ETHTOOL_A_HEADER_DEV_INDEX`` u32 device ifindex
``ETHTOOL_A_HEADER_DEV_NAME`` string device name
``ETHTOOL_A_HEADER_FLAGS`` u32 flags common for all requests
+ ``ETHTOOL_A_HEADER_PHY_INDEX`` u32 phy device index
============================== ====== =============================
``ETHTOOL_A_HEADER_DEV_INDEX`` and ``ETHTOOL_A_HEADER_DEV_NAME`` identify the
@@ -81,6 +82,12 @@ the behaviour is backward compatible, i.e. requests from old clients not aware
of the flag should be interpreted the way the client expects. A client must
not set flags it does not understand.
+``ETHTOOL_A_HEADER_PHY_INDEX`` identifies the Ethernet PHY the message relates to.
+As there are numerous commands that are related to PHY configuration, and because
+there may be more than one PHY on the link, the PHY index can be passed in the
+request for the commands that needs it. It is, however, not mandatory, and if it
+is not passed for commands that target a PHY, the net_device.phydev pointer
+is used.
Bit sets
========
@@ -228,6 +235,10 @@ Userspace to kernel:
``ETHTOOL_MSG_PLCA_GET_STATUS`` get PLCA RS status
``ETHTOOL_MSG_MM_GET`` get MAC merge layer state
``ETHTOOL_MSG_MM_SET`` set MAC merge layer parameters
+ ``ETHTOOL_MSG_MODULE_FW_FLASH_ACT`` flash transceiver module firmware
+ ``ETHTOOL_MSG_PHY_GET`` get Ethernet PHY information
+ ``ETHTOOL_MSG_TSCONFIG_GET`` get hw timestamping configuration
+ ``ETHTOOL_MSG_TSCONFIG_SET`` set hw timestamping configuration
===================================== =================================
Kernel to userspace:
@@ -274,6 +285,11 @@ Kernel to userspace:
``ETHTOOL_MSG_PLCA_GET_STATUS_REPLY`` PLCA RS status
``ETHTOOL_MSG_PLCA_NTF`` PLCA RS parameters
``ETHTOOL_MSG_MM_GET_REPLY`` MAC merge layer status
+ ``ETHTOOL_MSG_MODULE_FW_FLASH_NTF`` transceiver module flash updates
+ ``ETHTOOL_MSG_PHY_GET_REPLY`` Ethernet PHY information
+ ``ETHTOOL_MSG_PHY_NTF`` Ethernet PHY information change
+ ``ETHTOOL_MSG_TSCONFIG_GET_REPLY`` hw timestamping configuration
+ ``ETHTOOL_MSG_TSCONFIG_SET_REPLY`` new hw timestamping configuration
======================================== =================================
``GET`` requests are sent by userspace applications to retrieve device
@@ -883,6 +899,10 @@ Kernel response contents:
``ETHTOOL_A_RINGS_RX_PUSH`` u8 flag of RX Push mode
``ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN`` u32 size of TX push buffer
``ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN_MAX`` u32 max size of TX push buffer
+ ``ETHTOOL_A_RINGS_HDS_THRESH`` u32 threshold of
+ header / data split
+ ``ETHTOOL_A_RINGS_HDS_THRESH_MAX`` u32 max threshold of
+ header / data split
======================================= ====== ===========================
``ETHTOOL_A_RINGS_TCP_DATA_SPLIT`` indicates whether the device is usable with
@@ -925,25 +945,31 @@ Request contents:
``ETHTOOL_A_RINGS_RX_JUMBO`` u32 size of RX jumbo ring
``ETHTOOL_A_RINGS_TX`` u32 size of TX ring
``ETHTOOL_A_RINGS_RX_BUF_LEN`` u32 size of buffers on the ring
+ ``ETHTOOL_A_RINGS_TCP_DATA_SPLIT`` u8 TCP header / data split
``ETHTOOL_A_RINGS_CQE_SIZE`` u32 Size of TX/RX CQE
``ETHTOOL_A_RINGS_TX_PUSH`` u8 flag of TX Push mode
``ETHTOOL_A_RINGS_RX_PUSH`` u8 flag of RX Push mode
``ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN`` u32 size of TX push buffer
+ ``ETHTOOL_A_RINGS_HDS_THRESH`` u32 threshold of header / data split
==================================== ====== ===========================
Kernel checks that requested ring sizes do not exceed limits reported by
-driver. Driver may impose additional constraints and may not suspport all
+driver. Driver may impose additional constraints and may not support all
attributes.
``ETHTOOL_A_RINGS_CQE_SIZE`` specifies the completion queue event size.
-Completion queue events(CQE) are the events posted by NIC to indicate the
-completion status of a packet when the packet is sent(like send success or
-error) or received(like pointers to packet fragments). The CQE size parameter
+Completion queue events (CQE) are the events posted by NIC to indicate the
+completion status of a packet when the packet is sent (like send success or
+error) or received (like pointers to packet fragments). The CQE size parameter
enables to modify the CQE size other than default size if NIC supports it.
-A bigger CQE can have more receive buffer pointers inturn NIC can transfer
-a bigger frame from wire. Based on the NIC hardware, the overall completion
-queue size can be adjusted in the driver if CQE size is modified.
+A bigger CQE can have more receive buffer pointers, and in turn the NIC can
+transfer a bigger frame from wire. Based on the NIC hardware, the overall
+completion queue size can be adjusted in the driver if CQE size is modified.
+
+``ETHTOOL_A_RINGS_HDS_THRESH`` specifies the threshold value of
+header / data split feature. If a received packet size is larger than this
+threshold value, header and data will be split.
CHANNELS_GET
============
@@ -987,7 +1013,7 @@ Request contents:
===================================== ====== ==========================
Kernel checks that requested channel counts do not exceed limits reported by
-driver. Driver may impose additional constraints and may not suspport all
+driver. Driver may impose additional constraints and may not support all
attributes.
@@ -1033,6 +1059,8 @@ Kernel response contents:
``ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES`` u32 max aggr size, Tx
``ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES`` u32 max aggr packets, Tx
``ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS`` u32 time (us), aggr, Tx
+ ``ETHTOOL_A_COALESCE_RX_PROFILE`` nested profile of DIM, Rx
+ ``ETHTOOL_A_COALESCE_TX_PROFILE`` nested profile of DIM, Tx
=========================================== ====== =======================
Attributes are only included in reply if their value is not zero or the
@@ -1062,6 +1090,10 @@ block should be sent.
This feature is mainly of interest for specific USB devices which does not cope
well with frequent small-sized URBs transmissions.
+``ETHTOOL_A_COALESCE_RX_PROFILE`` and ``ETHTOOL_A_COALESCE_TX_PROFILE`` refer
+to DIM parameters, see `Generic Network Dynamic Interrupt Moderation (Net DIM)
+<https://www.kernel.org/doc/Documentation/networking/net_dim.rst>`_.
+
COALESCE_SET
============
@@ -1098,6 +1130,8 @@ Request contents:
``ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES`` u32 max aggr size, Tx
``ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES`` u32 max aggr packets, Tx
``ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS`` u32 time (us), aggr, Tx
+ ``ETHTOOL_A_COALESCE_RX_PROFILE`` nested profile of DIM, Rx
+ ``ETHTOOL_A_COALESCE_TX_PROFILE`` nested profile of DIM, Tx
=========================================== ====== =======================
Request is rejected if it attributes declared as unsupported by driver (i.e.
@@ -1225,9 +1259,10 @@ Gets timestamping information like ``ETHTOOL_GET_TS_INFO`` ioctl request.
Request contents:
- ===================================== ====== ==========================
- ``ETHTOOL_A_TSINFO_HEADER`` nested request header
- ===================================== ====== ==========================
+ ======================================== ====== ============================
+ ``ETHTOOL_A_TSINFO_HEADER`` nested request header
+ ``ETHTOOL_A_TSINFO_HWTSTAMP_PROVIDER`` nested PTP hw clock provider
+ ======================================== ====== ============================
Kernel response contents:
@@ -1237,12 +1272,27 @@ Kernel response contents:
``ETHTOOL_A_TSINFO_TX_TYPES`` bitset supported Tx types
``ETHTOOL_A_TSINFO_RX_FILTERS`` bitset supported Rx filters
``ETHTOOL_A_TSINFO_PHC_INDEX`` u32 PTP hw clock index
+ ``ETHTOOL_A_TSINFO_STATS`` nested HW timestamping statistics
===================================== ====== ==========================
``ETHTOOL_A_TSINFO_PHC_INDEX`` is absent if there is no associated PHC (there
is no special value for this case). The bitset attributes are omitted if they
would be empty (no bit set).
+Additional hardware timestamping statistics response contents:
+
+ ================================================== ====== =====================
+ ``ETHTOOL_A_TS_STAT_TX_PKTS`` uint Packets with Tx
+ HW timestamps
+ ``ETHTOOL_A_TS_STAT_TX_LOST`` uint Tx HW timestamp
+ not arrived count
+ ``ETHTOOL_A_TS_STAT_TX_ERR`` uint HW error request
+ Tx timestamp count
+ ``ETHTOOL_A_TS_STAT_TX_ONESTEP_PKTS_UNCONFIRMED`` uint Packets with one-step
+ HW TX timestamps with
+ unconfirmed delivery
+ ================================================== ====== =====================
+
CABLE_TEST
==========
@@ -1288,12 +1338,17 @@ information.
+-+-+-----------------------------------------+--------+---------------------+
| | | ``ETHTOOL_A_CABLE_RESULTS_CODE`` | u8 | result code |
+-+-+-----------------------------------------+--------+---------------------+
+ | | | ``ETHTOOL_A_CABLE_RESULT_SRC`` | u32 | information source |
+ +-+-+-----------------------------------------+--------+---------------------+
| | ``ETHTOOL_A_CABLE_NEST_FAULT_LENGTH`` | nested | cable length |
+-+-+-----------------------------------------+--------+---------------------+
| | | ``ETHTOOL_A_CABLE_FAULT_LENGTH_PAIR`` | u8 | pair number |
+-+-+-----------------------------------------+--------+---------------------+
| | | ``ETHTOOL_A_CABLE_FAULT_LENGTH_CM`` | u32 | length in cm |
+-+-+-----------------------------------------+--------+---------------------+
+ | | | ``ETHTOOL_A_CABLE_FAULT_LENGTH_SRC`` | u32 | information source |
+ +-+-+-----------------------------------------+--------+---------------------+
+
CABLE_TEST TDR
==============
@@ -1577,6 +1632,7 @@ the ``ETHTOOL_A_STATS_GROUPS`` bitset. Currently defined values are:
ETHTOOL_STATS_ETH_PHY eth-phy Basic IEEE 802.3 PHY statistics (30.3.2.1.*)
ETHTOOL_STATS_ETH_CTRL eth-ctrl Basic IEEE 802.3 MAC Ctrl statistics (30.3.3.*)
ETHTOOL_STATS_RMON rmon RMON (RFC 2819) statistics
+ ETHTOOL_STATS_PHY phy Additional PHY statistics, not defined by IEEE
====================== ======== ===============================================
Each group should have a corresponding ``ETHTOOL_A_STATS_GRP`` in the reply.
@@ -1711,32 +1767,100 @@ Request contents:
Kernel response contents:
- ====================================== ====== =============================
- ``ETHTOOL_A_PSE_HEADER`` nested reply header
- ``ETHTOOL_A_PODL_PSE_ADMIN_STATE`` u32 Operational state of the PoDL
- PSE functions
- ``ETHTOOL_A_PODL_PSE_PW_D_STATUS`` u32 power detection status of the
- PoDL PSE.
- ====================================== ====== =============================
+ ========================================== ====== =============================
+ ``ETHTOOL_A_PSE_HEADER`` nested reply header
+ ``ETHTOOL_A_PODL_PSE_ADMIN_STATE`` u32 Operational state of the PoDL
+ PSE functions
+ ``ETHTOOL_A_PODL_PSE_PW_D_STATUS`` u32 power detection status of the
+ PoDL PSE.
+ ``ETHTOOL_A_C33_PSE_ADMIN_STATE`` u32 Operational state of the PoE
+ PSE functions.
+ ``ETHTOOL_A_C33_PSE_PW_D_STATUS`` u32 power detection status of the
+ PoE PSE.
+ ``ETHTOOL_A_C33_PSE_PW_CLASS`` u32 power class of the PoE PSE.
+ ``ETHTOOL_A_C33_PSE_ACTUAL_PW`` u32 actual power drawn on the
+ PoE PSE.
+ ``ETHTOOL_A_C33_PSE_EXT_STATE`` u32 power extended state of the
+ PoE PSE.
+ ``ETHTOOL_A_C33_PSE_EXT_SUBSTATE`` u32 power extended substatus of
+ the PoE PSE.
+ ``ETHTOOL_A_C33_PSE_AVAIL_PW_LIMIT`` u32 currently configured power
+ limit of the PoE PSE.
+ ``ETHTOOL_A_C33_PSE_PW_LIMIT_RANGES`` nested Supported power limit
+ configuration ranges.
+ ========================================== ====== =============================
When set, the optional ``ETHTOOL_A_PODL_PSE_ADMIN_STATE`` attribute identifies
the operational state of the PoDL PSE functions. The operational state of the
PSE function can be changed using the ``ETHTOOL_A_PODL_PSE_ADMIN_CONTROL``
-action. This option is corresponding to ``IEEE 802.3-2018`` 30.15.1.1.2
+action. This attribute corresponds to ``IEEE 802.3-2018`` 30.15.1.1.2
aPoDLPSEAdminState. Possible values are:
.. kernel-doc:: include/uapi/linux/ethtool.h
:identifiers: ethtool_podl_pse_admin_state
+The same goes for ``ETHTOOL_A_C33_PSE_ADMIN_STATE`` implementing
+``IEEE 802.3-2022`` 30.9.1.1.2 aPSEAdminState.
+
+.. kernel-doc:: include/uapi/linux/ethtool.h
+ :identifiers: ethtool_c33_pse_admin_state
+
When set, the optional ``ETHTOOL_A_PODL_PSE_PW_D_STATUS`` attribute identifies
the power detection status of the PoDL PSE. The status depend on internal PSE
-state machine and automatic PD classification support. This option is
-corresponding to ``IEEE 802.3-2018`` 30.15.1.1.3 aPoDLPSEPowerDetectionStatus.
+state machine and automatic PD classification support. This attribute
+corresponds to ``IEEE 802.3-2018`` 30.15.1.1.3 aPoDLPSEPowerDetectionStatus.
Possible values are:
.. kernel-doc:: include/uapi/linux/ethtool.h
:identifiers: ethtool_podl_pse_pw_d_status
+The same goes for ``ETHTOOL_A_C33_PSE_ADMIN_PW_D_STATUS`` implementing
+``IEEE 802.3-2022`` 30.9.1.1.5 aPSEPowerDetectionStatus.
+
+.. kernel-doc:: include/uapi/linux/ethtool.h
+ :identifiers: ethtool_c33_pse_pw_d_status
+
+When set, the optional ``ETHTOOL_A_C33_PSE_PW_CLASS`` attribute identifies
+the power class of the C33 PSE. It depends on the class negotiated between
+the PSE and the PD. This attribute corresponds to ``IEEE 802.3-2022``
+30.9.1.1.8 aPSEPowerClassification.
+
+When set, the optional ``ETHTOOL_A_C33_PSE_ACTUAL_PW`` attribute identifies
+the actual power drawn by the C33 PSE. This attribute corresponds to
+``IEEE 802.3-2022`` 30.9.1.1.23 aPSEActualPower. Actual power is reported
+in mW.
+
+When set, the optional ``ETHTOOL_A_C33_PSE_EXT_STATE`` attribute identifies
+the extended error state of the C33 PSE. Possible values are:
+
+.. kernel-doc:: include/uapi/linux/ethtool.h
+ :identifiers: ethtool_c33_pse_ext_state
+
+When set, the optional ``ETHTOOL_A_C33_PSE_EXT_SUBSTATE`` attribute identifies
+the extended error state of the C33 PSE. Possible values are:
+Possible values are:
+
+.. kernel-doc:: include/uapi/linux/ethtool.h
+ :identifiers: ethtool_c33_pse_ext_substate_class_num_events
+ ethtool_c33_pse_ext_substate_error_condition
+ ethtool_c33_pse_ext_substate_mr_pse_enable
+ ethtool_c33_pse_ext_substate_option_detect_ted
+ ethtool_c33_pse_ext_substate_option_vport_lim
+ ethtool_c33_pse_ext_substate_ovld_detected
+ ethtool_c33_pse_ext_substate_pd_dll_power_type
+ ethtool_c33_pse_ext_substate_power_not_available
+ ethtool_c33_pse_ext_substate_short_detected
+
+When set, the optional ``ETHTOOL_A_C33_PSE_AVAIL_PW_LIMIT`` attribute
+identifies the C33 PSE power limit in mW.
+
+When set the optional ``ETHTOOL_A_C33_PSE_PW_LIMIT_RANGES`` nested attribute
+identifies the C33 PSE power limit ranges through
+``ETHTOOL_A_C33_PSE_PWR_VAL_LIMIT_RANGE_MIN`` and
+``ETHTOOL_A_C33_PSE_PWR_VAL_LIMIT_RANGE_MAX``.
+If the controller works with fixed classes, the min and max values will be
+equal.
+
PSE_SET
=======
@@ -1747,13 +1871,31 @@ Request contents:
====================================== ====== =============================
``ETHTOOL_A_PSE_HEADER`` nested request header
``ETHTOOL_A_PODL_PSE_ADMIN_CONTROL`` u32 Control PoDL PSE Admin state
+ ``ETHTOOL_A_C33_PSE_ADMIN_CONTROL`` u32 Control PSE Admin state
+ ``ETHTOOL_A_C33_PSE_AVAIL_PWR_LIMIT`` u32 Control PoE PSE available
+ power limit
====================================== ====== =============================
When set, the optional ``ETHTOOL_A_PODL_PSE_ADMIN_CONTROL`` attribute is used
-to control PoDL PSE Admin functions. This option is implementing
+to control PoDL PSE Admin functions. This option implements
``IEEE 802.3-2018`` 30.15.1.2.1 acPoDLPSEAdminControl. See
``ETHTOOL_A_PODL_PSE_ADMIN_STATE`` for supported values.
+The same goes for ``ETHTOOL_A_C33_PSE_ADMIN_CONTROL`` implementing
+``IEEE 802.3-2022`` 30.9.1.2.1 acPSEAdminControl.
+
+When set, the optional ``ETHTOOL_A_C33_PSE_AVAIL_PWR_LIMIT`` attribute is
+used to control the available power value limit for C33 PSE in milliwatts.
+This attribute corresponds to the `pse_available_power` variable described in
+``IEEE 802.3-2022`` 33.2.4.4 Variables and `pse_avail_pwr` in 145.2.5.4
+Variables, which are described in power classes.
+
+It was decided to use milliwatts for this interface to unify it with other
+power monitoring interfaces, which also use milliwatts, and to align with
+various existing products that document power consumption in watts rather than
+classes. If power limit configuration based on classes is needed, the
+conversion can be done in user space, for example by ethtool.
+
RSS_GET
=======
@@ -1762,15 +1904,24 @@ RSS context of an interface similar to ``ETHTOOL_GRSSH`` ioctl request.
Request contents:
-===================================== ====== ==========================
+===================================== ====== ============================
``ETHTOOL_A_RSS_HEADER`` nested request header
``ETHTOOL_A_RSS_CONTEXT`` u32 context number
-===================================== ====== ==========================
+ ``ETHTOOL_A_RSS_START_CONTEXT`` u32 start context number (dumps)
+===================================== ====== ============================
+
+``ETHTOOL_A_RSS_CONTEXT`` specifies which RSS context number to query,
+if not set context 0 (the main context) is queried. Dumps can be filtered
+by device (only listing contexts of a given netdev). Filtering single
+context number is not supported but ``ETHTOOL_A_RSS_START_CONTEXT``
+can be used to start dumping context from the given number (primarily
+used to ignore context 0s and only dump additional contexts).
Kernel response contents:
===================================== ====== ==========================
``ETHTOOL_A_RSS_HEADER`` nested reply header
+ ``ETHTOOL_A_RSS_CONTEXT`` u32 context number
``ETHTOOL_A_RSS_HFUNC`` u32 RSS hash func
``ETHTOOL_A_RSS_INDIR`` binary Indir table bytes
``ETHTOOL_A_RSS_HKEY`` binary Hash key bytes
@@ -1822,7 +1973,7 @@ When set, the optional ``ETHTOOL_A_PLCA_VERSION`` attribute indicates which
standard and version the PLCA management interface complies to. When not set,
the interface is vendor-specific and (possibly) supplied by the driver.
The OPEN Alliance SIG specifies a standard register map for 10BASE-T1S PHYs
-embedding the PLCA Reconcialiation Sublayer. See "10BASE-T1S PLCA Management
+embedding the PLCA Reconciliation Sublayer. See "10BASE-T1S PLCA Management
Registers" at https://www.opensig.org/about/specifications/.
When set, the optional ``ETHTOOL_A_PLCA_ENABLED`` attribute indicates the
@@ -1884,7 +2035,7 @@ Request contents:
``ETHTOOL_A_PLCA_ENABLED`` u8 PLCA Admin State
``ETHTOOL_A_PLCA_NODE_ID`` u8 PLCA unique local node ID
``ETHTOOL_A_PLCA_NODE_CNT`` u8 Number of PLCA nodes on the
- netkork, including the
+ network, including the
coordinator
``ETHTOOL_A_PLCA_TO_TMR`` u8 Transmit Opportunity Timer
value in bit-times (BT)
@@ -2004,6 +2155,185 @@ The attributes are propagated to the driver through the following structure:
.. kernel-doc:: include/linux/ethtool.h
:identifiers: ethtool_mm_cfg
+MODULE_FW_FLASH_ACT
+===================
+
+Flashes transceiver module firmware.
+
+Request contents:
+
+ ======================================= ====== ===========================
+ ``ETHTOOL_A_MODULE_FW_FLASH_HEADER`` nested request header
+ ``ETHTOOL_A_MODULE_FW_FLASH_FILE_NAME`` string firmware image file name
+ ``ETHTOOL_A_MODULE_FW_FLASH_PASSWORD`` u32 transceiver module password
+ ======================================= ====== ===========================
+
+The firmware update process consists of three logical steps:
+
+1. Downloading a firmware image to the transceiver module and validating it.
+2. Running the firmware image.
+3. Committing the firmware image so that it is run upon reset.
+
+When flash command is given, those three steps are taken in that order.
+
+This message merely schedules the update process and returns immediately
+without blocking. The process then runs asynchronously.
+Since it can take several minutes to complete, during the update process
+notifications are emitted from the kernel to user space updating it about
+the status and progress.
+
+The ``ETHTOOL_A_MODULE_FW_FLASH_FILE_NAME`` attribute encodes the firmware
+image file name. The firmware image is downloaded to the transceiver module,
+validated, run and committed.
+
+The optional ``ETHTOOL_A_MODULE_FW_FLASH_PASSWORD`` attribute encodes a password
+that might be required as part of the transceiver module firmware update
+process.
+
+The firmware update process can take several minutes to complete. Therefore,
+during the update process notifications are emitted from the kernel to user
+space updating it about the status and progress.
+
+
+
+Notification contents:
+
+ +---------------------------------------------------+--------+----------------+
+ | ``ETHTOOL_A_MODULE_FW_FLASH_HEADER`` | nested | reply header |
+ +---------------------------------------------------+--------+----------------+
+ | ``ETHTOOL_A_MODULE_FW_FLASH_STATUS`` | u32 | status |
+ +---------------------------------------------------+--------+----------------+
+ | ``ETHTOOL_A_MODULE_FW_FLASH_STATUS_MSG`` | string | status message |
+ +---------------------------------------------------+--------+----------------+
+ | ``ETHTOOL_A_MODULE_FW_FLASH_DONE`` | uint | progress |
+ +---------------------------------------------------+--------+----------------+
+ | ``ETHTOOL_A_MODULE_FW_FLASH_TOTAL`` | uint | total |
+ +---------------------------------------------------+--------+----------------+
+
+The ``ETHTOOL_A_MODULE_FW_FLASH_STATUS`` attribute encodes the current status
+of the firmware update process. Possible values are:
+
+.. kernel-doc:: include/uapi/linux/ethtool.h
+ :identifiers: ethtool_module_fw_flash_status
+
+The ``ETHTOOL_A_MODULE_FW_FLASH_STATUS_MSG`` attribute encodes a status message
+string.
+
+The ``ETHTOOL_A_MODULE_FW_FLASH_DONE`` and ``ETHTOOL_A_MODULE_FW_FLASH_TOTAL``
+attributes encode the completed and total amount of work, respectively.
+
+PHY_GET
+=======
+
+Retrieve information about a given Ethernet PHY sitting on the link. The DO
+operation returns all available information about dev->phydev. User can also
+specify a PHY_INDEX, in which case the DO request returns information about that
+specific PHY.
+
+As there can be more than one PHY, the DUMP operation can be used to list the PHYs
+present on a given interface, by passing an interface index or name in
+the dump request.
+
+For more information, refer to :ref:`phy_link_topology`
+
+Request contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_PHY_HEADER`` nested request header
+ ==================================== ====== ==========================
+
+Kernel response contents:
+
+ ===================================== ====== ===============================
+ ``ETHTOOL_A_PHY_HEADER`` nested request header
+ ``ETHTOOL_A_PHY_INDEX`` u32 the phy's unique index, that can
+ be used for phy-specific
+ requests
+ ``ETHTOOL_A_PHY_DRVNAME`` string the phy driver name
+ ``ETHTOOL_A_PHY_NAME`` string the phy device name
+ ``ETHTOOL_A_PHY_UPSTREAM_TYPE`` u32 the type of device this phy is
+ connected to
+ ``ETHTOOL_A_PHY_UPSTREAM_INDEX`` u32 the PHY index of the upstream
+ PHY
+ ``ETHTOOL_A_PHY_UPSTREAM_SFP_NAME`` string if this PHY is connected to
+ its parent PHY through an SFP
+ bus, the name of this sfp bus
+ ``ETHTOOL_A_PHY_DOWNSTREAM_SFP_NAME`` string if the phy controls an sfp bus,
+ the name of the sfp bus
+ ===================================== ====== ===============================
+
+When ``ETHTOOL_A_PHY_UPSTREAM_TYPE`` is PHY_UPSTREAM_PHY, the PHY's parent is
+another PHY.
+
+TSCONFIG_GET
+============
+
+Retrieves the information about the current hardware timestamping source and
+configuration.
+
+It is similar to the deprecated ``SIOCGHWTSTAMP`` ioctl request.
+
+Request contents:
+
+ ==================================== ====== ==========================
+ ``ETHTOOL_A_TSCONFIG_HEADER`` nested request header
+ ==================================== ====== ==========================
+
+Kernel response contents:
+
+ ======================================== ====== ============================
+ ``ETHTOOL_A_TSCONFIG_HEADER`` nested request header
+ ``ETHTOOL_A_TSCONFIG_HWTSTAMP_PROVIDER`` nested PTP hw clock provider
+ ``ETHTOOL_A_TSCONFIG_TX_TYPES`` bitset hwtstamp Tx type
+ ``ETHTOOL_A_TSCONFIG_RX_FILTERS`` bitset hwtstamp Rx filter
+ ``ETHTOOL_A_TSCONFIG_HWTSTAMP_FLAGS`` u32 hwtstamp flags
+ ======================================== ====== ============================
+
+When set the ``ETHTOOL_A_TSCONFIG_HWTSTAMP_PROVIDER`` attribute identifies the
+source of the hw timestamping provider. It is composed by
+``ETHTOOL_A_TS_HWTSTAMP_PROVIDER_INDEX`` attribute which describe the index of
+the PTP device and ``ETHTOOL_A_TS_HWTSTAMP_PROVIDER_QUALIFIER`` which describe
+the qualifier of the timestamp.
+
+When set the ``ETHTOOL_A_TSCONFIG_TX_TYPES``, ``ETHTOOL_A_TSCONFIG_RX_FILTERS``
+and the ``ETHTOOL_A_TSCONFIG_HWTSTAMP_FLAGS`` attributes identify the Tx
+type, the Rx filter and the flags configured for the current hw timestamping
+provider. The attributes are propagated to the driver through the following
+structure:
+
+.. kernel-doc:: include/linux/net_tstamp.h
+ :identifiers: kernel_hwtstamp_config
+
+TSCONFIG_SET
+============
+
+Set the information about the current hardware timestamping source and
+configuration.
+
+It is similar to the deprecated ``SIOCSHWTSTAMP`` ioctl request.
+
+Request contents:
+
+ ======================================== ====== ============================
+ ``ETHTOOL_A_TSCONFIG_HEADER`` nested request header
+ ``ETHTOOL_A_TSCONFIG_HWTSTAMP_PROVIDER`` nested PTP hw clock provider
+ ``ETHTOOL_A_TSCONFIG_TX_TYPES`` bitset hwtstamp Tx type
+ ``ETHTOOL_A_TSCONFIG_RX_FILTERS`` bitset hwtstamp Rx filter
+ ``ETHTOOL_A_TSCONFIG_HWTSTAMP_FLAGS`` u32 hwtstamp flags
+ ======================================== ====== ============================
+
+Kernel response contents:
+
+ ======================================== ====== ============================
+ ``ETHTOOL_A_TSCONFIG_HEADER`` nested request header
+ ``ETHTOOL_A_TSCONFIG_HWTSTAMP_PROVIDER`` nested PTP hw clock provider
+ ``ETHTOOL_A_TSCONFIG_TX_TYPES`` bitset hwtstamp Tx type
+ ``ETHTOOL_A_TSCONFIG_RX_FILTERS`` bitset hwtstamp Rx filter
+ ``ETHTOOL_A_TSCONFIG_HWTSTAMP_FLAGS`` u32 hwtstamp flags
+ ======================================== ====== ============================
+
+For a description of each attribute, see ``TSCONFIG_GET``.
+
Request translation
===================
@@ -2110,4 +2440,8 @@ are netlink only.
n/a ``ETHTOOL_MSG_PLCA_GET_STATUS``
n/a ``ETHTOOL_MSG_MM_GET``
n/a ``ETHTOOL_MSG_MM_SET``
+ n/a ``ETHTOOL_MSG_MODULE_FW_FLASH_ACT``
+ n/a ``ETHTOOL_MSG_PHY_GET``
+ ``SIOCGHWTSTAMP`` ``ETHTOOL_MSG_TSCONFIG_GET``
+ ``SIOCSHWTSTAMP`` ``ETHTOOL_MSG_TSCONFIG_SET``
=================================== =====================================
diff --git a/Documentation/networking/filter.rst b/Documentation/networking/filter.rst
index 7d8c5380492f..8eb9a5d40f31 100644
--- a/Documentation/networking/filter.rst
+++ b/Documentation/networking/filter.rst
@@ -513,7 +513,7 @@ JIT compiler
------------
The Linux kernel has a built-in BPF JIT compiler for x86_64, SPARC,
-PowerPC, ARM, ARM64, MIPS, RISC-V and s390 and can be enabled through
+PowerPC, ARM, ARM64, MIPS, RISC-V, s390, and ARC and can be enabled through
CONFIG_BPF_JIT. The JIT compiler is transparently invoked for each
attached filter from user space or for internal kernel users if it has
been previously enabled by root::
@@ -650,7 +650,7 @@ before a conversion to the new layout is being done behind the scenes!
Currently, the classic BPF format is being used for JITing on most
32-bit architectures, whereas x86-64, aarch64, s390x, powerpc64,
-sparc64, arm32, riscv64, riscv32, loongarch64 perform JIT compilation
+sparc64, arm32, riscv64, riscv32, loongarch64, arc perform JIT compilation
from eBPF instruction set.
Testing
diff --git a/Documentation/networking/ieee802154.rst b/Documentation/networking/ieee802154.rst
index c652d383fe10..743c0a80e309 100644
--- a/Documentation/networking/ieee802154.rst
+++ b/Documentation/networking/ieee802154.rst
@@ -72,7 +72,8 @@ exports a management (e.g. MLME) and data API.
possibly with some kinds of acceleration like automatic CRC computation and
comparison, automagic ACK handling, address matching, etc.
-Those types of devices require different approach to be hooked into Linux kernel.
+Each type of device requires a different approach to be hooked into the Linux
+kernel.
HardMAC
-------
@@ -81,10 +82,10 @@ See the header include/net/ieee802154_netdev.h. You have to implement Linux
net_device, with .type = ARPHRD_IEEE802154. Data is exchanged with socket family
code via plain sk_buffs. On skb reception skb->cb must contain additional
info as described in the struct ieee802154_mac_cb. During packet transmission
-the skb->cb is used to provide additional data to device's header_ops->create
-function. Be aware that this data can be overridden later (when socket code
-submits skb to qdisc), so if you need something from that cb later, you should
-store info in the skb->data on your own.
+the skb->cb is used to provide additional data to the device's
+header_ops->create function. Be aware that this data can be overridden later
+(when socket code submits skb to qdisc), so if you need something from that cb
+later, you should store info in the skb->data on your own.
To hook the MLME interface you have to populate the ml_priv field of your
net_device with a pointer to struct ieee802154_mlme_ops instance. The fields
@@ -94,8 +95,9 @@ All other fields are required.
SoftMAC
-------
-The MAC is the middle layer in the IEEE 802.15.4 Linux stack. This moment it
-provides interface for drivers registration and management of slave interfaces.
+The MAC is the middle layer in the IEEE 802.15.4 Linux stack. At the moment, it
+provides an interface for driver registration and management of slave
+interfaces.
NOTE: Currently the only monitor device type is supported - it's IEEE 802.15.4
stack interface for network sniffers (e.g. WireShark).
diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
index 473d72c36d61..058193ed2eeb 100644
--- a/Documentation/networking/index.rst
+++ b/Documentation/networking/index.rst
@@ -14,11 +14,13 @@ Contents:
can
can_ucan_protocol
device_drivers/index
+ diagnostic/index
dsa/index
devlink/index
caif/index
ethtool-netlink
ieee802154
+ iso15765-2
j1939
kapi
msg_zerocopy
@@ -48,6 +50,7 @@ Contents:
cdc_mbim
dccp
dctcp
+ devmem
dns_resolver
driver
eql
@@ -72,6 +75,7 @@ Contents:
mac80211-injection
mctp
mpls-sysctl
+ mptcp
mptcp-sysctl
multiqueue
multi-pf-netdev
@@ -82,17 +86,21 @@ Contents:
netdevices
netfilter-sysctl
netif-msg
+ netmem
nexthop-group-resilient
nf_conntrack-sysctl
nf_flowtable
+ oa-tc6-framework
openvswitch
operstates
packet_mmap
phonet
+ phy-link-topology
pktgen
plip
ppp_generic
proc_net_tcp
+ pse-pd/index
radiotap-headers
rds
regulatory
@@ -103,6 +111,7 @@ Contents:
seg6-sysctl
skbuff
smc-sysctl
+ sriov
statistics
strparser
switchdev
diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index bd50df6a5a42..363b4950d542 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -131,6 +131,20 @@ fib_multipath_hash_fields - UNSIGNED INTEGER
Default: 0x0007 (source IP, destination IP and IP protocol)
+fib_multipath_hash_seed - UNSIGNED INTEGER
+ The seed value used when calculating hash for multipath routes. Applies
+ to both IPv4 and IPv6 datapath. Only present for kernels built with
+ CONFIG_IP_ROUTE_MULTIPATH enabled.
+
+ When set to 0, the seed value used for multipath routing defaults to an
+ internal random-generated one.
+
+ The actual hashing algorithm is not specified -- there is no guarantee
+ that a next hop distribution effected by a given seed will keep stable
+ across kernel versions.
+
+ Default: 0 (random)
+
fib_sync_mem - UNSIGNED INTEGER
Amount of dirty memory from fib entries that can be backlogged before
synchronize_rcu is forced.
@@ -986,6 +1000,20 @@ tcp_tw_reuse - INTEGER
Default: 2
+tcp_tw_reuse_delay - UNSIGNED INTEGER
+ The delay in milliseconds before a TIME-WAIT socket can be reused by a
+ new connection, if TIME-WAIT socket reuse is enabled. The actual reuse
+ threshold is within [N, N+1] range, where N is the requested delay in
+ milliseconds, to ensure the delay interval is never shorter than the
+ configured value.
+
+ This setting contains an assumption about the other TCP timestamp clock
+ tick interval. It should not be set to a value lower than the peer's
+ clock tick for PAWS (Protection Against Wrapped Sequence numbers)
+ mechanism work correctly for the reused connection.
+
+ Default: 1000 (milliseconds)
+
tcp_window_scaling - BOOLEAN
Enable window scaling as defined in RFC1323.
@@ -1196,6 +1224,19 @@ tcp_pingpong_thresh - INTEGER
Default: 1
+tcp_rto_min_us - INTEGER
+ Minimal TCP retransmission timeout (in microseconds). Note that the
+ rto_min route option has the highest precedence for configuring this
+ setting, followed by the TCP_BPF_RTO_MIN socket option, followed by
+ this tcp_rto_min_us sysctl.
+
+ The recommended practice is to use a value less or equal to 200000
+ microseconds.
+
+ Possible Values: 1 - INT_MAX
+
+ Default: 200000
+
UDP variables
=============
@@ -2143,6 +2184,12 @@ nexthop_compat_mode - BOOLEAN
understands the new API, this sysctl can be disabled to achieve full
performance benefits of the new API by disabling the nexthop expansion
and extraneous notifications.
+
+ Note that as a backward-compatible mode, dumping of modern features
+ might be incomplete or wrong. For example, resilient groups will not be
+ shown as such, but rather as just a list of next hops. Also weights that
+ do not fit into 8 bits will show incorrectly.
+
Default: true (backward compat mode)
fib_notify_on_flag_change - INTEGER
@@ -2335,6 +2382,20 @@ ra_honor_pio_life - BOOLEAN
Default: 0 (disabled)
+ra_honor_pio_pflag - BOOLEAN
+ The Prefix Information Option P-flag indicates the network can
+ allocate a unique IPv6 prefix per client using DHCPv6-PD.
+ This sysctl can be enabled when a userspace DHCPv6-PD client
+ is running to cause the P-flag to take effect: i.e. the
+ P-flag suppresses any effects of the A-flag within the same
+ PIO. For a given PIO, P=1 and A=1 is treated as A=0.
+
+ - If disabled, the P-flag is ignored.
+ - If enabled, the P-flag will disable SLAAC autoconfiguration
+ for the given Prefix Information Option.
+
+ Default: 0 (disabled)
+
accept_ra_rt_info_min_plen - INTEGER
Minimum prefix length of Route Information in RA.
diff --git a/Documentation/networking/iso15765-2.rst b/Documentation/networking/iso15765-2.rst
new file mode 100644
index 000000000000..37ebb2c417cb
--- /dev/null
+++ b/Documentation/networking/iso15765-2.rst
@@ -0,0 +1,386 @@
+.. SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
+
+====================
+ISO 15765-2 (ISO-TP)
+====================
+
+Overview
+========
+
+ISO 15765-2, also known as ISO-TP, is a transport protocol specifically defined
+for diagnostic communication on CAN. It is widely used in the automotive
+industry, for example as the transport protocol for UDSonCAN (ISO 14229-3) or
+emission-related diagnostic services (ISO 15031-5).
+
+ISO-TP can be used both on CAN CC (aka Classical CAN) and CAN FD (CAN with
+Flexible Datarate) based networks. It is also designed to be compatible with a
+CAN network using SAE J1939 as data link layer (however, this is not a
+requirement).
+
+Specifications used
+-------------------
+
+* ISO 15765-2:2024 : Road vehicles - Diagnostic communication over Controller
+ Area Network (DoCAN). Part 2: Transport protocol and network layer services.
+
+Addressing
+----------
+
+In its simplest form, ISO-TP is based on two kinds of addressing modes for the
+nodes connected to the same network:
+
+* physical addressing is implemented by two node-specific addresses and is used
+ in 1-to-1 communication.
+
+* functional addressing is implemented by one node-specific address and is used
+ in 1-to-N communication.
+
+Three different addressing formats can be employed:
+
+* "normal" : each address is represented simply by a CAN ID.
+
+* "extended": each address is represented by a CAN ID plus the first byte of
+ the CAN payload; both the CAN ID and the byte inside the payload shall be
+ different between two addresses.
+
+* "mixed": each address is represented by a CAN ID plus the first byte of
+ the CAN payload; the CAN ID is different between two addresses, but the
+ additional byte is the same.
+
+Transport protocol and associated frame types
+---------------------------------------------
+
+When transmitting data using the ISO-TP protocol, the payload can either fit
+inside one single CAN message or not, also considering the overhead the protocol
+is generating and the optional extended addressing. In the first case, the data
+is transmitted at once using a so-called Single Frame (SF). In the second case,
+ISO-TP defines a multi-frame protocol, in which the sender provides (through a
+First Frame - FF) the PDU length which is to be transmitted and also asks for a
+Flow Control (FC) frame, which provides the maximum supported size of a macro
+data block (``blocksize``) and the minimum time between the single CAN messages
+composing such block (``stmin``). Once this information has been received, the
+sender starts to send frames containing fragments of the data payload (called
+Consecutive Frames - CF), stopping after every ``blocksize``-sized block to wait
+confirmation from the receiver which should then send another Flow Control
+frame to inform the sender about its availability to receive more data.
+
+How to Use ISO-TP
+=================
+
+As with others CAN protocols, the ISO-TP stack support is built into the
+Linux network subsystem for the CAN bus, aka. Linux-CAN or SocketCAN, and
+thus follows the same socket API.
+
+Creation and basic usage of an ISO-TP socket
+--------------------------------------------
+
+To use the ISO-TP stack, ``#include <linux/can/isotp.h>`` shall be used. A
+socket can then be created using the ``PF_CAN`` protocol family, the
+``SOCK_DGRAM`` type (as the underlying protocol is datagram-based by design)
+and the ``CAN_ISOTP`` protocol:
+
+.. code-block:: C
+
+ s = socket(PF_CAN, SOCK_DGRAM, CAN_ISOTP);
+
+After the socket has been successfully created, ``bind(2)`` shall be called to
+bind the socket to the desired CAN interface; to do so:
+
+* a TX CAN ID shall be specified as part of the sockaddr supplied to the call
+ itself.
+
+* a RX CAN ID shall also be specified, unless broadcast flags have been set
+ through socket option (explained below).
+
+Once bound to an interface, the socket can be read from and written to using
+the usual ``read(2)`` and ``write(2)`` system calls, as well as ``send(2)``,
+``sendmsg(2)``, ``recv(2)`` and ``recvmsg(2)``.
+Unlike the CAN_RAW socket API, only the ISO-TP data field (the actual payload)
+is sent and received by the userspace application using these calls. The address
+information and the protocol information are automatically filled by the ISO-TP
+stack using the configuration supplied during socket creation. In the same way,
+the stack will use the transport mechanism when required (i.e., when the size
+of the data payload exceeds the MTU of the underlying CAN bus).
+
+The sockaddr structure used for SocketCAN has extensions for use with ISO-TP,
+as specified below:
+
+.. code-block:: C
+
+ struct sockaddr_can {
+ sa_family_t can_family;
+ int can_ifindex;
+ union {
+ struct { canid_t rx_id, tx_id; } tp;
+ ...
+ } can_addr;
+ }
+
+* ``can_family`` and ``can_ifindex`` serve the same purpose as for other
+ SocketCAN sockets.
+
+* ``can_addr.tp.rx_id`` specifies the receive (RX) CAN ID and will be used as
+ a RX filter.
+
+* ``can_addr.tp.tx_id`` specifies the transmit (TX) CAN ID
+
+ISO-TP socket options
+---------------------
+
+When creating an ISO-TP socket, reasonable defaults are set. Some options can
+be modified with ``setsockopt(2)`` and/or read back with ``getsockopt(2)``.
+
+General options
+~~~~~~~~~~~~~~~
+
+General socket options can be passed using the ``CAN_ISOTP_OPTS`` optname:
+
+.. code-block:: C
+
+ struct can_isotp_options opts;
+ ret = setsockopt(s, SOL_CAN_ISOTP, CAN_ISOTP_OPTS, &opts, sizeof(opts))
+
+where the ``can_isotp_options`` structure has the following contents:
+
+.. code-block:: C
+
+ struct can_isotp_options {
+ u32 flags;
+ u32 frame_txtime;
+ u8 ext_address;
+ u8 txpad_content;
+ u8 rxpad_content;
+ u8 rx_ext_address;
+ };
+
+* ``flags``: modifiers to be applied to the default behaviour of the ISO-TP
+ stack. Following flags are available:
+
+ * ``CAN_ISOTP_LISTEN_MODE``: listen only (do not send FC frames); normally
+ used as a testing feature.
+
+ * ``CAN_ISOTP_EXTEND_ADDR``: use the byte specified in ``ext_address`` as an
+ additional address component. This enables the "mixed" addressing format if
+ used alone, or the "extended" addressing format if used in conjunction with
+ ``CAN_ISOTP_RX_EXT_ADDR``.
+
+ * ``CAN_ISOTP_TX_PADDING``: enable padding for transmitted frames, using
+ ``txpad_content`` as value for the padding bytes.
+
+ * ``CAN_ISOTP_RX_PADDING``: enable padding for the received frames, using
+ ``rxpad_content`` as value for the padding bytes.
+
+ * ``CAN_ISOTP_CHK_PAD_LEN``: check for correct padding length on the received
+ frames.
+
+ * ``CAN_ISOTP_CHK_PAD_DATA``: check padding bytes on the received frames
+ against ``rxpad_content``; if ``CAN_ISOTP_RX_PADDING`` is not specified,
+ this flag is ignored.
+
+ * ``CAN_ISOTP_HALF_DUPLEX``: force ISO-TP socket in half duplex mode
+ (that is, transport mechanism can only be incoming or outgoing at the same
+ time, not both).
+
+ * ``CAN_ISOTP_FORCE_TXSTMIN``: ignore stmin from received FC; normally
+ used as a testing feature.
+
+ * ``CAN_ISOTP_FORCE_RXSTMIN``: ignore CFs depending on rx stmin; normally
+ used as a testing feature.
+
+ * ``CAN_ISOTP_RX_EXT_ADDR``: use ``rx_ext_address`` instead of ``ext_address``
+ as extended addressing byte on the reception path. If used in conjunction
+ with ``CAN_ISOTP_EXTEND_ADDR``, this flag effectively enables the "extended"
+ addressing format.
+
+ * ``CAN_ISOTP_WAIT_TX_DONE``: wait until the frame is sent before returning
+ from ``write(2)`` and ``send(2)`` calls (i.e., blocking write operations).
+
+ * ``CAN_ISOTP_SF_BROADCAST``: use 1-to-N functional addressing (cannot be
+ specified alongside ``CAN_ISOTP_CF_BROADCAST``).
+
+ * ``CAN_ISOTP_CF_BROADCAST``: use 1-to-N transmission without flow control
+ (cannot be specified alongside ``CAN_ISOTP_SF_BROADCAST``).
+ NOTE: this is not covered by the ISO 15765-2 standard.
+
+ * ``CAN_ISOTP_DYN_FC_PARMS``: enable dynamic update of flow control
+ parameters.
+
+* ``frame_txtime``: frame transmission time (defined as N_As/N_Ar inside the
+ ISO standard); if ``0``, the default (or the last set value) is used.
+ To set the transmission time to ``0``, the ``CAN_ISOTP_FRAME_TXTIME_ZERO``
+ macro (equal to 0xFFFFFFFF) shall be used.
+
+* ``ext_address``: extended addressing byte, used if the
+ ``CAN_ISOTP_EXTEND_ADDR`` flag is specified.
+
+* ``txpad_content``: byte used as padding value for transmitted frames.
+
+* ``rxpad_content``: byte used as padding value for received frames.
+
+* ``rx_ext_address``: extended addressing byte for the reception path, used if
+ the ``CAN_ISOTP_RX_EXT_ADDR`` flag is specified.
+
+Flow Control options
+~~~~~~~~~~~~~~~~~~~~
+
+Flow Control (FC) options can be passed using the ``CAN_ISOTP_RECV_FC`` optname
+to provide the communication parameters for receiving ISO-TP PDUs.
+
+.. code-block:: C
+
+ struct can_isotp_fc_options fc_opts;
+ ret = setsockopt(s, SOL_CAN_ISOTP, CAN_ISOTP_RECV_FC, &fc_opts, sizeof(fc_opts));
+
+where the ``can_isotp_fc_options`` structure has the following contents:
+
+.. code-block:: C
+
+ struct can_isotp_options {
+ u8 bs;
+ u8 stmin;
+ u8 wftmax;
+ };
+
+* ``bs``: blocksize provided in flow control frames.
+
+* ``stmin``: minimum separation time provided in flow control frames; can
+ have the following values (others are reserved):
+
+ * 0x00 - 0x7F : 0 - 127 ms
+
+ * 0xF1 - 0xF9 : 100 us - 900 us
+
+* ``wftmax``: maximum number of wait frames provided in flow control frames.
+
+Link Layer options
+~~~~~~~~~~~~~~~~~~
+
+Link Layer (LL) options can be passed using the ``CAN_ISOTP_LL_OPTS`` optname:
+
+.. code-block:: C
+
+ struct can_isotp_ll_options ll_opts;
+ ret = setsockopt(s, SOL_CAN_ISOTP, CAN_ISOTP_LL_OPTS, &ll_opts, sizeof(ll_opts));
+
+where the ``can_isotp_ll_options`` structure has the following contents:
+
+.. code-block:: C
+
+ struct can_isotp_ll_options {
+ u8 mtu;
+ u8 tx_dl;
+ u8 tx_flags;
+ };
+
+* ``mtu``: generated and accepted CAN frame type, can be equal to ``CAN_MTU``
+ for classical CAN frames or ``CANFD_MTU`` for CAN FD frames.
+
+* ``tx_dl``: maximum payload length for transmitted frames, can have one value
+ among: 8, 12, 16, 20, 24, 32, 48, 64. Values above 8 only apply to CAN FD
+ traffic (i.e.: ``mtu = CANFD_MTU``).
+
+* ``tx_flags``: flags set into ``struct canfd_frame.flags`` at frame creation.
+ Only applies to CAN FD traffic (i.e.: ``mtu = CANFD_MTU``).
+
+Transmission stmin
+~~~~~~~~~~~~~~~~~~
+
+The transmission minimum separation time (stmin) can be forced using the
+``CAN_ISOTP_TX_STMIN`` optname and providing an stmin value in microseconds as
+a 32bit unsigned integer; this will overwrite the value sent by the receiver in
+flow control frames:
+
+.. code-block:: C
+
+ uint32_t stmin;
+ ret = setsockopt(s, SOL_CAN_ISOTP, CAN_ISOTP_TX_STMIN, &stmin, sizeof(stmin));
+
+Reception stmin
+~~~~~~~~~~~~~~~
+
+The reception minimum separation time (stmin) can be forced using the
+``CAN_ISOTP_RX_STMIN`` optname and providing an stmin value in microseconds as
+a 32bit unsigned integer; received Consecutive Frames (CF) which timestamps
+differ less than this value will be ignored:
+
+.. code-block:: C
+
+ uint32_t stmin;
+ ret = setsockopt(s, SOL_CAN_ISOTP, CAN_ISOTP_RX_STMIN, &stmin, sizeof(stmin));
+
+Multi-frame transport support
+-----------------------------
+
+The ISO-TP stack contained inside the Linux kernel supports the multi-frame
+transport mechanism defined by the standard, with the following constraints:
+
+* the maximum size of a PDU is defined by a module parameter, with an hard
+ limit imposed at build time.
+
+* when a transmission is in progress, subsequent calls to ``write(2)`` will
+ block, while calls to ``send(2)`` will either block or fail depending on the
+ presence of the ``MSG_DONTWAIT`` flag.
+
+* no support is present for sending "wait frames": whether a PDU can be fully
+ received or not is decided when the First Frame is received.
+
+Errors
+------
+
+Following errors are reported to userspace:
+
+RX path errors
+~~~~~~~~~~~~~~
+
+============ ===============================================================
+-ETIMEDOUT timeout of data reception
+-EILSEQ sequence number mismatch during a multi-frame reception
+-EBADMSG data reception with wrong padding
+============ ===============================================================
+
+TX path errors
+~~~~~~~~~~~~~~
+
+========== =================================================================
+-ECOMM flow control reception timeout
+-EMSGSIZE flow control reception overflow
+-EBADMSG flow control reception with wrong layout/padding
+========== =================================================================
+
+Examples
+========
+
+Basic node example
+------------------
+
+Following example implements a node using "normal" physical addressing, with
+RX ID equal to 0x18DAF142 and a TX ID equal to 0x18DA42F1. All options are left
+to their default.
+
+.. code-block:: C
+
+ int s;
+ struct sockaddr_can addr;
+ int ret;
+
+ s = socket(PF_CAN, SOCK_DGRAM, CAN_ISOTP);
+ if (s < 0)
+ exit(1);
+
+ addr.can_family = AF_CAN;
+ addr.can_ifindex = if_nametoindex("can0");
+ addr.can_addr.tp.tx_id = 0x18DA42F1 | CAN_EFF_FLAG;
+ addr.can_addr.tp.rx_id = 0x18DAF142 | CAN_EFF_FLAG;
+
+ ret = bind(s, (struct sockaddr *)&addr, sizeof(addr));
+ if (ret < 0)
+ exit(1);
+
+ /* Data can now be received using read(s, ...) and sent using write(s, ...) */
+
+Additional examples
+-------------------
+
+More complete (and complex) examples can be found inside the ``isotp*`` userland
+tools, distributed as part of the ``can-utils`` utilities at:
+https://github.com/linux-can/can-utils
diff --git a/Documentation/networking/j1939.rst b/Documentation/networking/j1939.rst
index e4bd7aa1f5aa..544bad175aae 100644
--- a/Documentation/networking/j1939.rst
+++ b/Documentation/networking/j1939.rst
@@ -121,7 +121,7 @@ format, the Group Extension is set in the PS-field.
On the other hand, when using PDU1 format, the PS-field contains a so-called
Destination Address, which is _not_ part of the PGN. When communicating a PGN
-from user space to kernel (or vice versa) and PDU2 format is used, the PS-field
+from user space to kernel (or vice versa) and PDU1 format is used, the PS-field
of the PGN shall be set to zero. The Destination Address shall be set
elsewhere.
diff --git a/Documentation/networking/kapi.rst b/Documentation/networking/kapi.rst
index ea55f462cefa..98682b9a13ee 100644
--- a/Documentation/networking/kapi.rst
+++ b/Documentation/networking/kapi.rst
@@ -104,6 +104,9 @@ Driver Support
.. kernel-doc:: include/linux/netdevice.h
:internal:
+.. kernel-doc:: include/net/net_shaper.h
+ :internal:
+
PHY Support
-----------
diff --git a/Documentation/networking/l2tp.rst b/Documentation/networking/l2tp.rst
index 8496b467dea4..e8cf8b3e60ac 100644
--- a/Documentation/networking/l2tp.rst
+++ b/Documentation/networking/l2tp.rst
@@ -638,9 +638,8 @@ Tunnels are identified by a unique tunnel id. The id is 16-bit for
L2TPv2 and 32-bit for L2TPv3. Internally, the id is stored as a 32-bit
value.
-Tunnels are kept in a per-net list, indexed by tunnel id. The tunnel
-id namespace is shared by L2TPv2 and L2TPv3. The tunnel context can be
-derived from the socket's sk_user_data.
+Tunnels are kept in a per-net list, indexed by tunnel id. The
+tunnel id namespace is shared by L2TPv2 and L2TPv3.
Handling tunnel socket close is perhaps the most tricky part of the
L2TP implementation. If userspace closes a tunnel socket, the L2TP
@@ -652,9 +651,7 @@ socket's encap_destroy handler is invoked, which L2TP uses to initiate
its tunnel close actions. For L2TPIP sockets, the socket's close
handler initiates the same tunnel close actions. All sessions are
first closed. Each session drops its tunnel ref. When the tunnel ref
-reaches zero, the tunnel puts its socket ref. When the socket is
-eventually destroyed, its sk_destruct finally frees the L2TP tunnel
-context.
+reaches zero, the tunnel drops its socket ref.
Sessions
--------
@@ -667,10 +664,7 @@ pseudowire) or other data types such as PPP, ATM, HDLC or Frame
Relay. Linux currently implements only Ethernet and PPP session types.
Some L2TP session types also have a socket (PPP pseudowires) while
-others do not (Ethernet pseudowires). We can't therefore use the
-socket reference count as the reference count for session
-contexts. The L2TP implementation therefore has its own internal
-reference counts on the session contexts.
+others do not (Ethernet pseudowires).
Like tunnels, L2TP sessions are identified by a unique
session id. Just as with tunnel ids, the session id is 16-bit for
@@ -680,21 +674,19 @@ value.
Sessions hold a ref on their parent tunnel to ensure that the tunnel
stays extant while one or more sessions references it.
-Sessions are kept in a per-tunnel list, indexed by session id. L2TPv3
-sessions are also kept in a per-net list indexed by session id,
-because L2TPv3 session ids are unique across all tunnels and L2TPv3
-data packets do not contain a tunnel id in the header. This list is
-therefore needed to find the session context associated with a
-received data packet when the tunnel context cannot be derived from
-the tunnel socket.
+Sessions are kept in a per-net list. L2TPv2 sessions and L2TPv3
+sessions are stored in separate lists. L2TPv2 sessions are keyed
+by a 32-bit key made up of the 16-bit tunnel ID and 16-bit
+session ID. L2TPv3 sessions are keyed by the 32-bit session ID, since
+L2TPv3 session ids are unique across all tunnels.
Although the L2TPv3 RFC specifies that L2TPv3 session ids are not
-scoped by the tunnel, the kernel does not police this for L2TPv3 UDP
-tunnels and does not add sessions of L2TPv3 UDP tunnels into the
-per-net session list. In the UDP receive code, we must trust that the
-tunnel can be identified using the tunnel socket's sk_user_data and
-lookup the session in the tunnel's session list instead of the per-net
-session list.
+scoped by the tunnel, the Linux implementation has historically
+allowed this. Such session id collisions are supported using a per-net
+hash table keyed by sk and session ID. When looking up L2TPv3
+sessions, the list entry may link to multiple sessions with that
+session ID, in which case the session matching the given sk (tunnel)
+is used.
PPP
---
@@ -714,10 +706,9 @@ The L2TP PPP implementation handles the closing of a PPPoL2TP socket
by closing its corresponding L2TP session. This is complicated because
it must consider racing with netlink session create/destroy requests
and pppol2tp_connect trying to reconnect with a session that is in the
-process of being closed. Unlike tunnels, PPP sessions do not hold a
-ref on their associated socket, so code must be careful to sock_hold
-the socket where necessary. For all the details, see commit
-3d609342cc04129ff7568e19316ce3d7451a27e8.
+process of being closed. PPP sessions hold a ref on their associated
+socket in order that the socket remains extants while the session
+references it.
Ethernet
--------
@@ -761,15 +752,10 @@ Limitations
The current implementation has a number of limitations:
- 1) Multiple UDP sockets with the same 5-tuple address cannot be
- used. The kernel's tunnel context is identified using private
- data associated with the socket so it is important that each
- socket is uniquely identified by its address.
-
- 2) Interfacing with openvswitch is not yet implemented. It may be
+ 1) Interfacing with openvswitch is not yet implemented. It may be
useful to map OVS Ethernet and VLAN ports into L2TPv3 tunnels.
- 3) VLAN pseudowires are implemented using an ``l2tpethN`` interface
+ 2) VLAN pseudowires are implemented using an ``l2tpethN`` interface
configured with a VLAN sub-interface. Since L2TPv3 VLAN
pseudowires carry one and only one VLAN, it may be better to use
a single netdevice rather than an ``l2tpethN`` and ``l2tpethN``:M
diff --git a/Documentation/networking/mptcp-sysctl.rst b/Documentation/networking/mptcp-sysctl.rst
index 69975ce25a02..03e1d3610333 100644
--- a/Documentation/networking/mptcp-sysctl.rst
+++ b/Documentation/networking/mptcp-sysctl.rst
@@ -7,14 +7,6 @@ MPTCP Sysfs variables
/proc/sys/net/mptcp/* Variables
===============================
-enabled - BOOLEAN
- Control whether MPTCP sockets can be created.
-
- MPTCP sockets can be created if the value is 1. This is a
- per-namespace sysctl.
-
- Default: 1 (enabled)
-
add_addr_timeout - INTEGER (seconds)
Set the timeout after which an ADD_ADDR control message will be
resent to an MPTCP peer that has not acknowledged a previous
@@ -25,16 +17,33 @@ add_addr_timeout - INTEGER (seconds)
Default: 120
-close_timeout - INTEGER (seconds)
- Set the make-after-break timeout: in absence of any close or
- shutdown syscall, MPTCP sockets will maintain the status
- unchanged for such time, after the last subflow removal, before
- moving to TCP_CLOSE.
+allow_join_initial_addr_port - BOOLEAN
+ Allow peers to send join requests to the IP address and port number used
+ by the initial subflow if the value is 1. This controls a flag that is
+ sent to the peer at connection time, and whether such join requests are
+ accepted or denied.
- The default value matches TCP_TIMEWAIT_LEN. This is a per-namespace
- sysctl.
+ Joins to addresses advertised with ADD_ADDR are not affected by this
+ value.
- Default: 60
+ This is a per-namespace sysctl.
+
+ Default: 1
+
+available_schedulers - STRING
+ Shows the available schedulers choices that are registered. More packet
+ schedulers may be available, but not loaded.
+
+blackhole_timeout - INTEGER (seconds)
+ Initial time period in second to disable MPTCP on active MPTCP sockets
+ when a MPTCP firewall blackhole issue happens. This time period will
+ grow exponentially when more blackhole issues get detected right after
+ MPTCP is re-enabled and will reset to the initial value when the
+ blackhole issue goes away.
+
+ 0 to disable the blackhole detection. This is a per-namespace sysctl.
+
+ Default: 3600
checksum_enabled - BOOLEAN
Control whether DSS checksum can be enabled.
@@ -44,18 +53,24 @@ checksum_enabled - BOOLEAN
Default: 0
-allow_join_initial_addr_port - BOOLEAN
- Allow peers to send join requests to the IP address and port number used
- by the initial subflow if the value is 1. This controls a flag that is
- sent to the peer at connection time, and whether such join requests are
- accepted or denied.
+close_timeout - INTEGER (seconds)
+ Set the make-after-break timeout: in absence of any close or
+ shutdown syscall, MPTCP sockets will maintain the status
+ unchanged for such time, after the last subflow removal, before
+ moving to TCP_CLOSE.
- Joins to addresses advertised with ADD_ADDR are not affected by this
- value.
+ The default value matches TCP_TIMEWAIT_LEN. This is a per-namespace
+ sysctl.
- This is a per-namespace sysctl.
+ Default: 60
- Default: 1
+enabled - BOOLEAN
+ Control whether MPTCP sockets can be created.
+
+ MPTCP sockets can be created if the value is 1. This is a
+ per-namespace sysctl.
+
+ Default: 1 (enabled)
pm_type - INTEGER
Set the default path manager type to use for each new MPTCP
@@ -74,6 +89,14 @@ pm_type - INTEGER
Default: 0
+scheduler - STRING
+ Select the scheduler of your choice.
+
+ Support for selection of different schedulers. This is a per-namespace
+ sysctl.
+
+ Default: "default"
+
stale_loss_cnt - INTEGER
The number of MPTCP-level retransmission intervals with no traffic and
pending outstanding data on a given subflow required to declare it stale.
@@ -86,10 +109,18 @@ stale_loss_cnt - INTEGER
Default: 4
-scheduler - STRING
- Select the scheduler of your choice.
+syn_retrans_before_tcp_fallback - INTEGER
+ The number of SYN + MP_CAPABLE retransmissions before falling back to
+ TCP, i.e. dropping the MPTCP options. In other words, if all the packets
+ are dropped on the way, there will be:
- Support for selection of different schedulers. This is a per-namespace
- sysctl.
+ * The initial SYN with MPTCP support
+ * This number of SYN retransmitted with MPTCP support
+ * The next SYN retransmissions will be without MPTCP support
- Default: "default"
+ 0 means the first retransmission will be done without MPTCP options.
+ >= 128 means that all SYN retransmissions will keep the MPTCP options. A
+ lower number might increase false-positive MPTCP blackholes detections.
+ This is a per-namespace sysctl.
+
+ Default: 2
diff --git a/Documentation/networking/mptcp.rst b/Documentation/networking/mptcp.rst
new file mode 100644
index 000000000000..17f2bab61164
--- /dev/null
+++ b/Documentation/networking/mptcp.rst
@@ -0,0 +1,156 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+Multipath TCP (MPTCP)
+=====================
+
+Introduction
+============
+
+Multipath TCP or MPTCP is an extension to the standard TCP and is described in
+`RFC 8684 (MPTCPv1) <https://www.rfc-editor.org/rfc/rfc8684.html>`_. It allows a
+device to make use of multiple interfaces at once to send and receive TCP
+packets over a single MPTCP connection. MPTCP can aggregate the bandwidth of
+multiple interfaces or prefer the one with the lowest latency. It also allows a
+fail-over if one path is down, and the traffic is seamlessly reinjected on other
+paths.
+
+For more details about Multipath TCP in the Linux kernel, please see the
+official website: `mptcp.dev <https://www.mptcp.dev>`_.
+
+
+Use cases
+=========
+
+Thanks to MPTCP, being able to use multiple paths in parallel or simultaneously
+brings new use-cases, compared to TCP:
+
+- Seamless handovers: switching from one path to another while preserving
+ established connections, e.g. to be used in mobility use-cases, like on
+ smartphones.
+- Best network selection: using the "best" available path depending on some
+ conditions, e.g. latency, losses, cost, bandwidth, etc.
+- Network aggregation: using multiple paths at the same time to have a higher
+ throughput, e.g. to combine fixed and mobile networks to send files faster.
+
+
+Concepts
+========
+
+Technically, when a new socket is created with the ``IPPROTO_MPTCP`` protocol
+(Linux-specific), a *subflow* (or *path*) is created. This *subflow* consists of
+a regular TCP connection that is used to transmit data through one interface.
+Additional *subflows* can be negotiated later between the hosts. For the remote
+host to be able to detect the use of MPTCP, a new field is added to the TCP
+*option* field of the underlying TCP *subflow*. This field contains, amongst
+other things, a ``MP_CAPABLE`` option that tells the other host to use MPTCP if
+it is supported. If the remote host or any middlebox in between does not support
+it, the returned ``SYN+ACK`` packet will not contain MPTCP options in the TCP
+*option* field. In that case, the connection will be "downgraded" to plain TCP,
+and it will continue with a single path.
+
+This behavior is made possible by two internal components: the path manager, and
+the packet scheduler.
+
+Path Manager
+------------
+
+The Path Manager is in charge of *subflows*, from creation to deletion, and also
+address announcements. Typically, it is the client side that initiates subflows,
+and the server side that announces additional addresses via the ``ADD_ADDR`` and
+``REMOVE_ADDR`` options.
+
+Path managers are controlled by the ``net.mptcp.pm_type`` sysctl knob -- see
+mptcp-sysctl.rst. There are two types: the in-kernel one (type ``0``) where the
+same rules are applied for all the connections (see: ``ip mptcp``) ; and the
+userspace one (type ``1``), controlled by a userspace daemon (i.e. `mptcpd
+<https://mptcpd.mptcp.dev/>`_) where different rules can be applied for each
+connection. The path managers can be controlled via a Netlink API; see
+netlink_spec/mptcp_pm.rst.
+
+To be able to use multiple IP addresses on a host to create multiple *subflows*
+(paths), the default in-kernel MPTCP path-manager needs to know which IP
+addresses can be used. This can be configured with ``ip mptcp endpoint`` for
+example.
+
+Packet Scheduler
+----------------
+
+The Packet Scheduler is in charge of selecting which available *subflow(s)* to
+use to send the next data packet. It can decide to maximize the use of the
+available bandwidth, only to pick the path with the lower latency, or any other
+policy depending on the configuration.
+
+Packet schedulers are controlled by the ``net.mptcp.scheduler`` sysctl knob --
+see mptcp-sysctl.rst.
+
+
+Sockets API
+===========
+
+Creating MPTCP sockets
+----------------------
+
+On Linux, MPTCP can be used by selecting MPTCP instead of TCP when creating the
+``socket``:
+
+.. code-block:: C
+
+ int sd = socket(AF_INET(6), SOCK_STREAM, IPPROTO_MPTCP);
+
+Note that ``IPPROTO_MPTCP`` is defined as ``262``.
+
+If MPTCP is not supported, ``errno`` will be set to:
+
+- ``EINVAL``: (*Invalid argument*): MPTCP is not available, on kernels < 5.6.
+- ``EPROTONOSUPPORT`` (*Protocol not supported*): MPTCP has not been compiled,
+ on kernels >= v5.6.
+- ``ENOPROTOOPT`` (*Protocol not available*): MPTCP has been disabled using
+ ``net.mptcp.enabled`` sysctl knob; see mptcp-sysctl.rst.
+
+MPTCP is then opt-in: applications need to explicitly request it. Note that
+applications can be forced to use MPTCP with different techniques, e.g.
+``LD_PRELOAD`` (see ``mptcpize``), eBPF (see ``mptcpify``), SystemTAP,
+``GODEBUG`` (``GODEBUG=multipathtcp=1``), etc.
+
+Switching to ``IPPROTO_MPTCP`` instead of ``IPPROTO_TCP`` should be as
+transparent as possible for the userspace applications.
+
+Socket options
+--------------
+
+MPTCP supports most socket options handled by TCP. It is possible some less
+common options are not supported, but contributions are welcome.
+
+Generally, the same value is propagated to all subflows, including the ones
+created after the calls to ``setsockopt()``. eBPF can be used to set different
+values per subflow.
+
+There are some MPTCP specific socket options at the ``SOL_MPTCP`` (284) level to
+retrieve info. They fill the ``optval`` buffer of the ``getsockopt()`` system
+call:
+
+- ``MPTCP_INFO``: Uses ``struct mptcp_info``.
+- ``MPTCP_TCPINFO``: Uses ``struct mptcp_subflow_data``, followed by an array of
+ ``struct tcp_info``.
+- ``MPTCP_SUBFLOW_ADDRS``: Uses ``struct mptcp_subflow_data``, followed by an
+ array of ``mptcp_subflow_addrs``.
+- ``MPTCP_FULL_INFO``: Uses ``struct mptcp_full_info``, with one pointer to an
+ array of ``struct mptcp_subflow_info`` (including the
+ ``struct mptcp_subflow_addrs``), and one pointer to an array of
+ ``struct tcp_info``, followed by the content of ``struct mptcp_info``.
+
+Note that at the TCP level, ``TCP_IS_MPTCP`` socket option can be used to know
+if MPTCP is currently being used: the value will be set to 1 if it is.
+
+
+Design choices
+==============
+
+A new socket type has been added for MPTCP for the userspace-facing socket. The
+kernel is in charge of creating subflow sockets: they are TCP sockets where the
+behavior is modified using TCP-ULP.
+
+MPTCP listen sockets will create "plain" *accepted* TCP sockets if the
+connection request from the client didn't ask for MPTCP, making the performance
+impact minimal when MPTCP is enabled by default.
diff --git a/Documentation/networking/multi-pf-netdev.rst b/Documentation/networking/multi-pf-netdev.rst
index 268819225866..2f5a5bb3ca9a 100644
--- a/Documentation/networking/multi-pf-netdev.rst
+++ b/Documentation/networking/multi-pf-netdev.rst
@@ -89,7 +89,7 @@ Observability
=============
The relation between PF, irq, napi, and queue can be observed via netlink spec::
- $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml --dump queue-get --json='{"ifindex": 13}'
+ $ ./tools/net/ynl/pyynl/cli.py --spec Documentation/netlink/specs/netdev.yaml --dump queue-get --json='{"ifindex": 13}'
[{'id': 0, 'ifindex': 13, 'napi-id': 539, 'type': 'rx'},
{'id': 1, 'ifindex': 13, 'napi-id': 540, 'type': 'rx'},
{'id': 2, 'ifindex': 13, 'napi-id': 541, 'type': 'rx'},
@@ -101,7 +101,7 @@ The relation between PF, irq, napi, and queue can be observed via netlink spec::
{'id': 3, 'ifindex': 13, 'napi-id': 542, 'type': 'tx'},
{'id': 4, 'ifindex': 13, 'napi-id': 543, 'type': 'tx'}]
- $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml --dump napi-get --json='{"ifindex": 13}'
+ $ ./tools/net/ynl/pyynl/cli.py --spec Documentation/netlink/specs/netdev.yaml --dump napi-get --json='{"ifindex": 13}'
[{'id': 543, 'ifindex': 13, 'irq': 42},
{'id': 542, 'ifindex': 13, 'irq': 41},
{'id': 541, 'ifindex': 13, 'irq': 40},
@@ -111,11 +111,11 @@ The relation between PF, irq, napi, and queue can be observed via netlink spec::
Here you can clearly observe our channels distribution policy::
$ ls /proc/irq/{36,39,40,41,42}/mlx5* -d -1
- /proc/irq/36/mlx5_comp1@pci:0000:08:00.0
- /proc/irq/39/mlx5_comp1@pci:0000:09:00.0
- /proc/irq/40/mlx5_comp2@pci:0000:08:00.0
- /proc/irq/41/mlx5_comp2@pci:0000:09:00.0
- /proc/irq/42/mlx5_comp3@pci:0000:08:00.0
+ /proc/irq/36/mlx5_comp0@pci:0000:08:00.0
+ /proc/irq/39/mlx5_comp0@pci:0000:09:00.0
+ /proc/irq/40/mlx5_comp1@pci:0000:08:00.0
+ /proc/irq/41/mlx5_comp1@pci:0000:09:00.0
+ /proc/irq/42/mlx5_comp2@pci:0000:08:00.0
Steering
========
diff --git a/Documentation/networking/napi.rst b/Documentation/networking/napi.rst
index 7bf7b95c4f7a..f970a2be271a 100644
--- a/Documentation/networking/napi.rst
+++ b/Documentation/networking/napi.rst
@@ -144,9 +144,8 @@ IRQ should only be unmasked after a successful call to napi_complete_done():
napi_schedule_irqoff() is a variant of napi_schedule() which takes advantage
of guarantees given by being invoked in IRQ context (no need to
-mask interrupts). Note that PREEMPT_RT forces all interrupts
-to be threaded so the interrupt may need to be marked ``IRQF_NO_THREAD``
-to avoid issues on real-time kernel configurations.
+mask interrupts). napi_schedule_irqoff() will fall back to napi_schedule() if
+IRQs are threaded (such as if ``PREEMPT_RT`` is enabled).
Instance to queue mapping
-------------------------
@@ -193,6 +192,33 @@ is reused to control the delay of the timer, while
``napi_defer_hard_irqs`` controls the number of consecutive empty polls
before NAPI gives up and goes back to using hardware IRQs.
+The above parameters can also be set on a per-NAPI basis using netlink via
+netdev-genl. When used with netlink and configured on a per-NAPI basis, the
+parameters mentioned above use hyphens instead of underscores:
+``gro-flush-timeout`` and ``napi-defer-hard-irqs``.
+
+Per-NAPI configuration can be done programmatically in a user application
+or by using a script included in the kernel source tree:
+``tools/net/ynl/pyynl/cli.py``.
+
+For example, using the script:
+
+.. code-block:: bash
+
+ $ kernel-source/tools/net/ynl/pyynl/cli.py \
+ --spec Documentation/netlink/specs/netdev.yaml \
+ --do napi-set \
+ --json='{"id": 345,
+ "defer-hard-irqs": 111,
+ "gro-flush-timeout": 11111}'
+
+Similarly, the parameter ``irq-suspend-timeout`` can be set using netlink
+via netdev-genl. There is no global sysfs parameter for this value.
+
+``irq-suspend-timeout`` is used to determine how long an application can
+completely suspend IRQs. It is used in combination with SO_PREFER_BUSY_POLL,
+which can be set on a per-epoll context basis with ``EPIOCSPARAMS`` ioctl.
+
.. _poll:
Busy polling
@@ -208,6 +234,46 @@ selected sockets or using the global ``net.core.busy_poll`` and
``net.core.busy_read`` sysctls. An io_uring API for NAPI busy polling
also exists.
+epoll-based busy polling
+------------------------
+
+It is possible to trigger packet processing directly from calls to
+``epoll_wait``. In order to use this feature, a user application must ensure
+all file descriptors which are added to an epoll context have the same NAPI ID.
+
+If the application uses a dedicated acceptor thread, the application can obtain
+the NAPI ID of the incoming connection using SO_INCOMING_NAPI_ID and then
+distribute that file descriptor to a worker thread. The worker thread would add
+the file descriptor to its epoll context. This would ensure each worker thread
+has an epoll context with FDs that have the same NAPI ID.
+
+Alternatively, if the application uses SO_REUSEPORT, a bpf or ebpf program can
+be inserted to distribute incoming connections to threads such that each thread
+is only given incoming connections with the same NAPI ID. Care must be taken to
+carefully handle cases where a system may have multiple NICs.
+
+In order to enable busy polling, there are two choices:
+
+1. ``/proc/sys/net/core/busy_poll`` can be set with a time in useconds to busy
+ loop waiting for events. This is a system-wide setting and will cause all
+ epoll-based applications to busy poll when they call epoll_wait. This may
+ not be desirable as many applications may not have the need to busy poll.
+
+2. Applications using recent kernels can issue an ioctl on the epoll context
+ file descriptor to set (``EPIOCSPARAMS``) or get (``EPIOCGPARAMS``) ``struct
+ epoll_params``:, which user programs can define as follows:
+
+.. code-block:: c
+
+ struct epoll_params {
+ uint32_t busy_poll_usecs;
+ uint16_t busy_poll_budget;
+ uint8_t prefer_busy_poll;
+
+ /* pad the struct to a multiple of 64bits */
+ uint8_t __pad;
+ };
+
IRQ mitigation
---------------
@@ -223,12 +289,111 @@ Such applications can pledge to the kernel that they will perform a busy
polling operation periodically, and the driver should keep the device IRQs
permanently masked. This mode is enabled by using the ``SO_PREFER_BUSY_POLL``
socket option. To avoid system misbehavior the pledge is revoked
-if ``gro_flush_timeout`` passes without any busy poll call.
+if ``gro_flush_timeout`` passes without any busy poll call. For epoll-based
+busy polling applications, the ``prefer_busy_poll`` field of ``struct
+epoll_params`` can be set to 1 and the ``EPIOCSPARAMS`` ioctl can be issued to
+enable this mode. See the above section for more details.
The NAPI budget for busy polling is lower than the default (which makes
sense given the low latency intention of normal busy polling). This is
not the case with IRQ mitigation, however, so the budget can be adjusted
-with the ``SO_BUSY_POLL_BUDGET`` socket option.
+with the ``SO_BUSY_POLL_BUDGET`` socket option. For epoll-based busy polling
+applications, the ``busy_poll_budget`` field can be adjusted to the desired value
+in ``struct epoll_params`` and set on a specific epoll context using the ``EPIOCSPARAMS``
+ioctl. See the above section for more details.
+
+It is important to note that choosing a large value for ``gro_flush_timeout``
+will defer IRQs to allow for better batch processing, but will induce latency
+when the system is not fully loaded. Choosing a small value for
+``gro_flush_timeout`` can cause interference of the user application which is
+attempting to busy poll by device IRQs and softirq processing. This value
+should be chosen carefully with these tradeoffs in mind. epoll-based busy
+polling applications may be able to mitigate how much user processing happens
+by choosing an appropriate value for ``maxevents``.
+
+Users may want to consider an alternate approach, IRQ suspension, to help deal
+with these tradeoffs.
+
+IRQ suspension
+--------------
+
+IRQ suspension is a mechanism wherein device IRQs are masked while epoll
+triggers NAPI packet processing.
+
+While application calls to epoll_wait successfully retrieve events, the kernel will
+defer the IRQ suspension timer. If the kernel does not retrieve any events
+while busy polling (for example, because network traffic levels subsided), IRQ
+suspension is disabled and the IRQ mitigation strategies described above are
+engaged.
+
+This allows users to balance CPU consumption with network processing
+efficiency.
+
+To use this mechanism:
+
+ 1. The per-NAPI config parameter ``irq-suspend-timeout`` should be set to the
+ maximum time (in nanoseconds) the application can have its IRQs
+ suspended. This is done using netlink, as described above. This timeout
+ serves as a safety mechanism to restart IRQ driver interrupt processing if
+ the application has stalled. This value should be chosen so that it covers
+ the amount of time the user application needs to process data from its
+ call to epoll_wait, noting that applications can control how much data
+ they retrieve by setting ``max_events`` when calling epoll_wait.
+
+ 2. The sysfs parameter or per-NAPI config parameters ``gro_flush_timeout``
+ and ``napi_defer_hard_irqs`` can be set to low values. They will be used
+ to defer IRQs after busy poll has found no data.
+
+ 3. The ``prefer_busy_poll`` flag must be set to true. This can be done using
+ the ``EPIOCSPARAMS`` ioctl as described above.
+
+ 4. The application uses epoll as described above to trigger NAPI packet
+ processing.
+
+As mentioned above, as long as subsequent calls to epoll_wait return events to
+userland, the ``irq-suspend-timeout`` is deferred and IRQs are disabled. This
+allows the application to process data without interference.
+
+Once a call to epoll_wait results in no events being found, IRQ suspension is
+automatically disabled and the ``gro_flush_timeout`` and
+``napi_defer_hard_irqs`` mitigation mechanisms take over.
+
+It is expected that ``irq-suspend-timeout`` will be set to a value much larger
+than ``gro_flush_timeout`` as ``irq-suspend-timeout`` should suspend IRQs for
+the duration of one userland processing cycle.
+
+While it is not strictly necessary to use ``napi_defer_hard_irqs`` and
+``gro_flush_timeout`` to use IRQ suspension, their use is strongly
+recommended.
+
+IRQ suspension causes the system to alternate between polling mode and
+irq-driven packet delivery. During busy periods, ``irq-suspend-timeout``
+overrides ``gro_flush_timeout`` and keeps the system busy polling, but when
+epoll finds no events, the setting of ``gro_flush_timeout`` and
+``napi_defer_hard_irqs`` determine the next step.
+
+There are essentially three possible loops for network processing and
+packet delivery:
+
+1) hardirq -> softirq -> napi poll; basic interrupt delivery
+2) timer -> softirq -> napi poll; deferred irq processing
+3) epoll -> busy-poll -> napi poll; busy looping
+
+Loop 2 can take control from Loop 1, if ``gro_flush_timeout`` and
+``napi_defer_hard_irqs`` are set.
+
+If ``gro_flush_timeout`` and ``napi_defer_hard_irqs`` are set, Loops 2
+and 3 "wrestle" with each other for control.
+
+During busy periods, ``irq-suspend-timeout`` is used as timer in Loop 2,
+which essentially tilts network processing in favour of Loop 3.
+
+If ``gro_flush_timeout`` and ``napi_defer_hard_irqs`` are not set, Loop 3
+cannot take control from Loop 1.
+
+Therefore, setting ``gro_flush_timeout`` and ``napi_defer_hard_irqs`` is
+the recommended usage, because otherwise setting ``irq-suspend-timeout``
+might not have any discernible effect.
.. _threaded:
diff --git a/Documentation/networking/net_cachelines/inet_connection_sock.rst b/Documentation/networking/net_cachelines/inet_connection_sock.rst
index 7a911dc95652..4a15627fc93b 100644
--- a/Documentation/networking/net_cachelines/inet_connection_sock.rst
+++ b/Documentation/networking/net_cachelines/inet_connection_sock.rst
@@ -5,46 +5,48 @@
inet_connection_sock struct fast path usage breakdown
=====================================================
+=================================== ====================== =================== =================== ========================================================================================================================================================
Type Name fastpath_tx_access fastpath_rx_access comment
-..struct ..inet_connection_sock
-struct_inet_sock icsk_inet read_mostly read_mostly tcp_init_buffer_space,tcp_init_transfer,tcp_finish_connect,tcp_connect,tcp_send_rcvq,tcp_send_syn_data
-struct_request_sock_queue icsk_accept_queue - -
-struct_inet_bind_bucket icsk_bind_hash read_mostly - tcp_set_state
-struct_inet_bind2_bucket icsk_bind2_hash read_mostly - tcp_set_state,inet_put_port
-unsigned_long icsk_timeout read_mostly - inet_csk_reset_xmit_timer,tcp_connect
-struct_timer_list icsk_retransmit_timer read_mostly - inet_csk_reset_xmit_timer,tcp_connect
-struct_timer_list icsk_delack_timer read_mostly - inet_csk_reset_xmit_timer,tcp_connect
-u32 icsk_rto read_write - tcp_cwnd_validate,tcp_schedule_loss_probe,tcp_connect_init,tcp_connect,tcp_write_xmit,tcp_push_one
-u32 icsk_rto_min - -
-u32 icsk_delack_max - -
-u32 icsk_pmtu_cookie read_write - tcp_sync_mss,tcp_current_mss,tcp_send_syn_data,tcp_connect_init,tcp_connect
-struct_tcp_congestion_ops icsk_ca_ops read_write - tcp_cwnd_validate,tcp_tso_segs,tcp_ca_dst_init,tcp_connect_init,tcp_connect,tcp_write_xmit
-struct_inet_connection_sock_af_ops icsk_af_ops read_mostly - tcp_finish_connect,tcp_send_syn_data,tcp_mtup_init,tcp_mtu_check_reprobe,tcp_mtu_probe,tcp_connect_init,tcp_connect,__tcp_transmit_skb
-struct_tcp_ulp_ops* icsk_ulp_ops - -
-void* icsk_ulp_data - -
-u8:5 icsk_ca_state read_write - tcp_cwnd_application_limited,tcp_set_ca_state,tcp_enter_cwr,tcp_tso_should_defer,tcp_mtu_probe,tcp_schedule_loss_probe,tcp_write_xmit,__tcp_transmit_skb
-u8:1 icsk_ca_initialized read_write - tcp_init_transfer,tcp_init_congestion_control,tcp_init_transfer,tcp_finish_connect,tcp_connect
-u8:1 icsk_ca_setsockopt - -
-u8:1 icsk_ca_dst_locked write_mostly - tcp_ca_dst_init,tcp_connect_init,tcp_connect
-u8 icsk_retransmits write_mostly - tcp_connect_init,tcp_connect
-u8 icsk_pending read_write - inet_csk_reset_xmit_timer,tcp_connect,tcp_check_probe_timer,__tcp_push_pending_frames,tcp_rearm_rto,tcp_event_new_data_sent,tcp_event_new_data_sent
-u8 icsk_backoff write_mostly - tcp_write_queue_purge,tcp_connect_init
-u8 icsk_syn_retries - -
-u8 icsk_probes_out - -
-u16 icsk_ext_hdr_len read_mostly - __tcp_mtu_to_mss,tcp_mtu_to_rss,tcp_mtu_probe,tcp_write_xmit,tcp_mtu_to_mss,
-struct_icsk_ack_u8 pending read_write read_write inet_csk_ack_scheduled,__tcp_cleanup_rbuf,tcp_cleanup_rbuf,inet_csk_clear_xmit_timer,tcp_event_ack-sent,inet_csk_reset_xmit_timer
-struct_icsk_ack_u8 quick read_write write_mostly tcp_dec_quickack_mode,tcp_event_ack_sent,__tcp_transmit_skb,__tcp_select_window,__tcp_cleanup_rbuf
-struct_icsk_ack_u8 pingpong - -
-struct_icsk_ack_u8 retry write_mostly read_write inet_csk_clear_xmit_timer,tcp_rearm_rto,tcp_event_new_data_sent,tcp_write_xmit,__tcp_send_ack,tcp_send_ack,
-struct_icsk_ack_u8 ato read_mostly write_mostly tcp_dec_quickack_mode,tcp_event_ack_sent,__tcp_transmit_skb,__tcp_send_ack,tcp_send_ack
-struct_icsk_ack_unsigned_long timeout read_write read_write inet_csk_reset_xmit_timer,tcp_connect
-struct_icsk_ack_u32 lrcvtime read_write - tcp_finish_connect,tcp_connect,tcp_event_data_sent,__tcp_transmit_skb
-struct_icsk_ack_u16 rcv_mss write_mostly read_mostly __tcp_select_window,__tcp_cleanup_rbuf,tcp_initialize_rcv_mss,tcp_connect_init
-struct_icsk_mtup_int search_high read_write - tcp_mtup_init,tcp_sync_mss,tcp_connect_init,tcp_mtu_check_reprobe,tcp_write_xmit
-struct_icsk_mtup_int search_low read_write - tcp_mtu_probe,tcp_mtu_check_reprobe,tcp_write_xmit,tcp_sync_mss,tcp_connect_init,tcp_mtup_init
-struct_icsk_mtup_u32:31 probe_size read_write - tcp_mtup_init,tcp_connect_init,__tcp_transmit_skb
-struct_icsk_mtup_u32:1 enabled read_write - tcp_mtup_init,tcp_sync_mss,tcp_connect_init,tcp_mtu_probe,tcp_write_xmit
-struct_icsk_mtup_u32 probe_timestamp read_write - tcp_mtup_init,tcp_connect_init,tcp_mtu_check_reprobe,tcp_mtu_probe
-u32 icsk_probes_tstamp - -
-u32 icsk_user_timeout - -
-u64[104/sizeof(u64)] icsk_ca_priv - -
+=================================== ====================== =================== =================== ========================================================================================================================================================
+struct inet_sock icsk_inet read_mostly read_mostly tcp_init_buffer_space,tcp_init_transfer,tcp_finish_connect,tcp_connect,tcp_send_rcvq,tcp_send_syn_data
+struct request_sock_queue icsk_accept_queue
+struct inet_bind_bucket icsk_bind_hash read_mostly tcp_set_state
+struct inet_bind2_bucket icsk_bind2_hash read_mostly tcp_set_state,inet_put_port
+unsigned_long icsk_timeout read_mostly inet_csk_reset_xmit_timer,tcp_connect
+struct timer_list icsk_retransmit_timer read_mostly inet_csk_reset_xmit_timer,tcp_connect
+struct timer_list icsk_delack_timer read_mostly inet_csk_reset_xmit_timer,tcp_connect
+u32 icsk_rto read_write tcp_cwnd_validate,tcp_schedule_loss_probe,tcp_connect_init,tcp_connect,tcp_write_xmit,tcp_push_one
+u32 icsk_rto_min
+u32 icsk_delack_max
+u32 icsk_pmtu_cookie read_write tcp_sync_mss,tcp_current_mss,tcp_send_syn_data,tcp_connect_init,tcp_connect
+struct tcp_congestion_ops icsk_ca_ops read_write tcp_cwnd_validate,tcp_tso_segs,tcp_ca_dst_init,tcp_connect_init,tcp_connect,tcp_write_xmit
+struct inet_connection_sock_af_ops icsk_af_ops read_mostly tcp_finish_connect,tcp_send_syn_data,tcp_mtup_init,tcp_mtu_check_reprobe,tcp_mtu_probe,tcp_connect_init,tcp_connect,__tcp_transmit_skb
+struct tcp_ulp_ops* icsk_ulp_ops
+void* icsk_ulp_data
+u8:5 icsk_ca_state read_write tcp_cwnd_application_limited,tcp_set_ca_state,tcp_enter_cwr,tcp_tso_should_defer,tcp_mtu_probe,tcp_schedule_loss_probe,tcp_write_xmit,__tcp_transmit_skb
+u8:1 icsk_ca_initialized read_write tcp_init_transfer,tcp_init_congestion_control,tcp_init_transfer,tcp_finish_connect,tcp_connect
+u8:1 icsk_ca_setsockopt
+u8:1 icsk_ca_dst_locked write_mostly tcp_ca_dst_init,tcp_connect_init,tcp_connect
+u8 icsk_retransmits write_mostly tcp_connect_init,tcp_connect
+u8 icsk_pending read_write inet_csk_reset_xmit_timer,tcp_connect,tcp_check_probe_timer,__tcp_push_pending_frames,tcp_rearm_rto,tcp_event_new_data_sent,tcp_event_new_data_sent
+u8 icsk_backoff write_mostly tcp_write_queue_purge,tcp_connect_init
+u8 icsk_syn_retries
+u8 icsk_probes_out
+u16 icsk_ext_hdr_len read_mostly __tcp_mtu_to_mss,tcp_mtu_to_rss,tcp_mtu_probe,tcp_write_xmit,tcp_mtu_to_mss,
+struct icsk_ack_u8 pending read_write read_write inet_csk_ack_scheduled,__tcp_cleanup_rbuf,tcp_cleanup_rbuf,inet_csk_clear_xmit_timer,tcp_event_ack-sent,inet_csk_reset_xmit_timer
+struct icsk_ack_u8 quick read_write write_mostly tcp_dec_quickack_mode,tcp_event_ack_sent,__tcp_transmit_skb,__tcp_select_window,__tcp_cleanup_rbuf
+struct icsk_ack_u8 pingpong
+struct icsk_ack_u8 retry write_mostly read_write inet_csk_clear_xmit_timer,tcp_rearm_rto,tcp_event_new_data_sent,tcp_write_xmit,__tcp_send_ack,tcp_send_ack,
+struct icsk_ack_u8 ato read_mostly write_mostly tcp_dec_quickack_mode,tcp_event_ack_sent,__tcp_transmit_skb,__tcp_send_ack,tcp_send_ack
+struct icsk_ack_unsigned_long timeout read_write read_write inet_csk_reset_xmit_timer,tcp_connect
+struct icsk_ack_u32 lrcvtime read_write tcp_finish_connect,tcp_connect,tcp_event_data_sent,__tcp_transmit_skb
+struct icsk_ack_u16 rcv_mss write_mostly read_mostly __tcp_select_window,__tcp_cleanup_rbuf,tcp_initialize_rcv_mss,tcp_connect_init
+struct icsk_mtup_int search_high read_write tcp_mtup_init,tcp_sync_mss,tcp_connect_init,tcp_mtu_check_reprobe,tcp_write_xmit
+struct icsk_mtup_int search_low read_write tcp_mtu_probe,tcp_mtu_check_reprobe,tcp_write_xmit,tcp_sync_mss,tcp_connect_init,tcp_mtup_init
+struct icsk_mtup_u32:31 probe_size read_write tcp_mtup_init,tcp_connect_init,__tcp_transmit_skb
+struct icsk_mtup_u32:1 enabled read_write tcp_mtup_init,tcp_sync_mss,tcp_connect_init,tcp_mtu_probe,tcp_write_xmit
+struct icsk_mtup_u32 probe_timestamp read_write tcp_mtup_init,tcp_connect_init,tcp_mtu_check_reprobe,tcp_mtu_probe
+u32 icsk_probes_tstamp
+u32 icsk_user_timeout
+u64[104/sizeof(u64)] icsk_ca_priv
+=================================== ====================== =================== =================== ========================================================================================================================================================
diff --git a/Documentation/networking/net_cachelines/inet_sock.rst b/Documentation/networking/net_cachelines/inet_sock.rst
index 595d7ef5fc8b..b11bf48fa2b3 100644
--- a/Documentation/networking/net_cachelines/inet_sock.rst
+++ b/Documentation/networking/net_cachelines/inet_sock.rst
@@ -5,40 +5,42 @@
inet_sock struct fast path usage breakdown
==========================================
+======================= ===================== =================== =================== ======================================================================================================
Type Name fastpath_tx_access fastpath_rx_access comment
-..struct ..inet_sock
-struct_sock sk read_mostly read_mostly tcp_init_buffer_space,tcp_init_transfer,tcp_finish_connect,tcp_connect,tcp_send_rcvq,tcp_send_syn_data
-struct_ipv6_pinfo* pinet6 - -
-be16 inet_sport read_mostly - __tcp_transmit_skb
-be32 inet_daddr read_mostly - ip_select_ident_segs
-be32 inet_rcv_saddr - -
-be16 inet_dport read_mostly - __tcp_transmit_skb
-u16 inet_num - -
-be32 inet_saddr - -
-s16 uc_ttl read_mostly - __ip_queue_xmit/ip_select_ttl
-u16 cmsg_flags - -
-struct_ip_options_rcu* inet_opt read_mostly - __ip_queue_xmit
-u16 inet_id read_mostly - ip_select_ident_segs
-u8 tos read_mostly - ip_queue_xmit
-u8 min_ttl - -
-u8 mc_ttl - -
-u8 pmtudisc - -
-u8:1 recverr - -
-u8:1 is_icsk - -
-u8:1 freebind - -
-u8:1 hdrincl - -
-u8:1 mc_loop - -
-u8:1 transparent - -
-u8:1 mc_all - -
-u8:1 nodefrag - -
-u8:1 bind_address_no_port - -
-u8:1 recverr_rfc4884 - -
-u8:1 defer_connect read_mostly - tcp_sendmsg_fastopen
-u8 rcv_tos - -
-u8 convert_csum - -
-int uc_index - -
-int mc_index - -
-be32 mc_addr - -
-struct_ip_mc_socklist* mc_list - -
-struct_inet_cork_full cork read_mostly - __tcp_transmit_skb
-struct local_port_range - -
+======================= ===================== =================== =================== ======================================================================================================
+struct sock sk read_mostly read_mostly tcp_init_buffer_space,tcp_init_transfer,tcp_finish_connect,tcp_connect,tcp_send_rcvq,tcp_send_syn_data
+struct ipv6_pinfo* pinet6
+be16 inet_sport read_mostly __tcp_transmit_skb
+be32 inet_daddr read_mostly ip_select_ident_segs
+be32 inet_rcv_saddr
+be16 inet_dport read_mostly __tcp_transmit_skb
+u16 inet_num
+be32 inet_saddr
+s16 uc_ttl read_mostly __ip_queue_xmit/ip_select_ttl
+u16 cmsg_flags
+struct ip_options_rcu* inet_opt read_mostly __ip_queue_xmit
+u16 inet_id read_mostly ip_select_ident_segs
+u8 tos read_mostly ip_queue_xmit
+u8 min_ttl
+u8 mc_ttl
+u8 pmtudisc
+u8:1 recverr
+u8:1 is_icsk
+u8:1 freebind
+u8:1 hdrincl
+u8:1 mc_loop
+u8:1 transparent
+u8:1 mc_all
+u8:1 nodefrag
+u8:1 bind_address_no_port
+u8:1 recverr_rfc4884
+u8:1 defer_connect read_mostly tcp_sendmsg_fastopen
+u8 rcv_tos
+u8 convert_csum
+int uc_index
+int mc_index
+be32 mc_addr
+struct ip_mc_socklist* mc_list
+struct inet_cork_full cork read_mostly __tcp_transmit_skb
+struct local_port_range
+======================= ===================== =================== =================== ======================================================================================================
diff --git a/Documentation/networking/net_cachelines/net_device.rst b/Documentation/networking/net_cachelines/net_device.rst
index 70c4fb9d4e5c..15e31ece675f 100644
--- a/Documentation/networking/net_cachelines/net_device.rst
+++ b/Documentation/networking/net_cachelines/net_device.rst
@@ -5,174 +5,188 @@
net_device struct fast path usage breakdown
===========================================
-Type Name fastpath_tx_access fastpath_rx_access Comments
-..struct ..net_device
-char name[16] - -
-struct_netdev_name_node* name_node
-struct_dev_ifalias* ifalias
-unsigned_long mem_end
-unsigned_long mem_start
-unsigned_long base_addr
-unsigned_long state read_mostly read_mostly netif_running(dev)
-struct_list_head dev_list
-struct_list_head napi_list
-struct_list_head unreg_list
-struct_list_head close_list
-struct_list_head ptype_all read_mostly - dev_nit_active(tx)
-struct_list_head ptype_specific read_mostly deliver_ptype_list_skb/__netif_receive_skb_core(rx)
-struct adj_list
-unsigned_int flags read_mostly read_mostly __dev_queue_xmit,__dev_xmit_skb,ip6_output,__ip6_finish_output(tx);ip6_rcv_core(rx)
-xdp_features_t xdp_features
-unsigned_long_long priv_flags read_mostly - __dev_queue_xmit(tx)
-struct_net_device_ops* netdev_ops read_mostly - netdev_core_pick_tx,netdev_start_xmit(tx)
-struct_xdp_metadata_ops* xdp_metadata_ops
-int ifindex - read_mostly ip6_rcv_core
-unsigned_short gflags
-unsigned_short hard_header_len read_mostly read_mostly ip6_xmit(tx);gro_list_prepare(rx)
-unsigned_int mtu read_mostly - ip_finish_output2
-unsigned_short needed_headroom read_mostly - LL_RESERVED_SPACE/ip_finish_output2
-unsigned_short needed_tailroom
-netdev_features_t features read_mostly read_mostly HARD_TX_LOCK,netif_skb_features,sk_setup_caps(tx);netif_elide_gro(rx)
-netdev_features_t hw_features
-netdev_features_t wanted_features
-netdev_features_t vlan_features
-netdev_features_t hw_enc_features - - netif_skb_features
-netdev_features_t mpls_features
-netdev_features_t gso_partial_features read_mostly gso_features_check
-unsigned_int min_mtu
-unsigned_int max_mtu
-unsigned_short type
-unsigned_char min_header_len
-unsigned_char name_assign_type
-int group
-struct_net_device_stats stats
-struct_net_device_core_stats* core_stats
-atomic_t carrier_up_count
-atomic_t carrier_down_count
-struct_iw_handler_def* wireless_handlers
-struct_iw_public_data* wireless_data
-struct_ethtool_ops* ethtool_ops
-struct_l3mdev_ops* l3mdev_ops
-struct_ndisc_ops* ndisc_ops
-struct_xfrmdev_ops* xfrmdev_ops
-struct_tlsdev_ops* tlsdev_ops
-struct_header_ops* header_ops read_mostly - ip_finish_output2,ip6_finish_output2(tx)
-unsigned_char operstate
-unsigned_char link_mode
-unsigned_char if_port
-unsigned_char dma
-unsigned_char perm_addr[32]
-unsigned_char addr_assign_type
-unsigned_char addr_len
-unsigned_char upper_level
-unsigned_char lower_level
-unsigned_short neigh_priv_len
-unsigned_short padded
-unsigned_short dev_id
-unsigned_short dev_port
-spinlock_t addr_list_lock
-int irq
-struct_netdev_hw_addr_list uc
-struct_netdev_hw_addr_list mc
-struct_netdev_hw_addr_list dev_addrs
-struct_kset* queues_kset
-struct_list_head unlink_list
-unsigned_int promiscuity
-unsigned_int allmulti
-bool uc_promisc
-unsigned_char nested_level
-struct_in_device* ip_ptr read_mostly read_mostly __in_dev_get
-struct_inet6_dev* ip6_ptr read_mostly read_mostly __in6_dev_get
-struct_vlan_info* vlan_info
-struct_dsa_port* dsa_ptr
-struct_tipc_bearer* tipc_ptr
-void* atalk_ptr
-void* ax25_ptr
-struct_wireless_dev* ieee80211_ptr
-struct_wpan_dev* ieee802154_ptr
-struct_mpls_dev* mpls_ptr
-struct_mctp_dev* mctp_ptr
-unsigned_char* dev_addr
-struct_netdev_queue* _rx read_mostly - netdev_get_rx_queue(rx)
-unsigned_int num_rx_queues
-unsigned_int real_num_rx_queues - read_mostly get_rps_cpu
-struct_bpf_prog* xdp_prog - read_mostly netif_elide_gro()
-unsigned_long gro_flush_timeout - read_mostly napi_complete_done
-int napi_defer_hard_irqs - read_mostly napi_complete_done
-unsigned_int gro_max_size - read_mostly skb_gro_receive
-unsigned_int gro_ipv4_max_size - read_mostly skb_gro_receive
-rx_handler_func_t* rx_handler read_mostly - __netif_receive_skb_core
-void* rx_handler_data read_mostly -
-struct_netdev_queue* ingress_queue read_mostly -
-struct_bpf_mprog_entry tcx_ingress - read_mostly sch_handle_ingress
-struct_nf_hook_entries* nf_hooks_ingress
-unsigned_char broadcast[32]
-struct_cpu_rmap* rx_cpu_rmap
-struct_hlist_node index_hlist
-struct_netdev_queue* _tx read_mostly - netdev_get_tx_queue(tx)
-unsigned_int num_tx_queues - -
-unsigned_int real_num_tx_queues read_mostly - skb_tx_hash,netdev_core_pick_tx(tx)
-unsigned_int tx_queue_len
-spinlock_t tx_global_lock
-struct_xdp_dev_bulk_queue__percpu* xdp_bulkq
-struct_xps_dev_maps* xps_maps[2] read_mostly - __netif_set_xps_queue
-struct_bpf_mprog_entry tcx_egress read_mostly - sch_handle_egress
-struct_nf_hook_entries* nf_hooks_egress read_mostly -
-struct_hlist_head qdisc_hash[16]
-struct_timer_list watchdog_timer
-int watchdog_timeo
-u32 proto_down_reason
-struct_list_head todo_list
-int__percpu* pcpu_refcnt
-refcount_t dev_refcnt
-struct_ref_tracker_dir refcnt_tracker
-struct_list_head link_watch_list
-enum:8 reg_state
-bool dismantle
-enum:16 rtnl_link_state
-bool needs_free_netdev
-void*priv_destructor struct_net_device
-struct_netpoll_info* npinfo - read_mostly napi_poll/napi_poll_lock
-possible_net_t nd_net - read_mostly (dev_net)napi_busy_loop,tcp_v(4/6)_rcv,ip(v6)_rcv,ip(6)_input,ip(6)_input_finish
-void* ml_priv
-enum_netdev_ml_priv_type ml_priv_type
-struct_pcpu_lstats__percpu* lstats read_mostly dev_lstats_add()
-struct_pcpu_sw_netstats__percpu* tstats read_mostly dev_sw_netstats_tx_add()
-struct_pcpu_dstats__percpu* dstats
-struct_garp_port* garp_port
-struct_mrp_port* mrp_port
-struct_dm_hw_stat_delta* dm_private
-struct_device dev - -
-struct_attribute_group* sysfs_groups[4]
-struct_attribute_group* sysfs_rx_queue_group
-struct_rtnl_link_ops* rtnl_link_ops
-unsigned_int gso_max_size read_mostly - sk_dst_gso_max_size
-unsigned_int tso_max_size
-u16 gso_max_segs read_mostly - gso_max_segs
-u16 tso_max_segs
-unsigned_int gso_ipv4_max_size read_mostly - sk_dst_gso_max_size
-struct_dcbnl_rtnl_ops* dcbnl_ops
-s16 num_tc read_mostly - skb_tx_hash
-struct_netdev_tc_txq tc_to_txq[16] read_mostly - skb_tx_hash
-u8 prio_tc_map[16]
-unsigned_int fcoe_ddp_xid
-struct_netprio_map* priomap
-struct_phy_device* phydev
-struct_sfp_bus* sfp_bus
-struct_lock_class_key* qdisc_tx_busylock
-bool proto_down
-unsigned:1 wol_enabled
-unsigned:1 threaded - - napi_poll(napi_enable,dev_set_threaded)
-struct_list_head net_notifier_list
-struct_macsec_ops* macsec_ops
-struct_udp_tunnel_nic_info* udp_tunnel_nic_info
-struct_udp_tunnel_nic* udp_tunnel_nic
-unsigned_int xdp_zc_max_segs
-struct_bpf_xdp_entity xdp_state[3]
-u8 dev_addr_shadow[32]
-netdevice_tracker linkwatch_dev_tracker
-netdevice_tracker watchdog_dev_tracker
-netdevice_tracker dev_registered_tracker
-struct_rtnl_hw_stats64* offload_xstats_l3
-struct_devlink_port* devlink_port
-struct_dpll_pin* dpll_pin
+=================================== =========================== =================== =================== ===================================================================================
+Type Name fastpath_tx_access fastpath_rx_access Comments
+=================================== =========================== =================== =================== ===================================================================================
+unsigned_long:32 priv_flags read_mostly __dev_queue_xmit(tx)
+unsigned_long:1 lltx read_mostly HARD_TX_LOCK,HARD_TX_TRYLOCK,HARD_TX_UNLOCK(tx)
+char name[16]
+struct netdev_name_node* name_node
+struct dev_ifalias* ifalias
+unsigned_long mem_end
+unsigned_long mem_start
+unsigned_long base_addr
+unsigned_long state read_mostly read_mostly netif_running(dev)
+struct list_head dev_list
+struct list_head napi_list
+struct list_head unreg_list
+struct list_head close_list
+struct list_head ptype_all read_mostly dev_nit_active(tx)
+struct list_head ptype_specific read_mostly deliver_ptype_list_skb/__netif_receive_skb_core(rx)
+struct adj_list
+unsigned_int flags read_mostly read_mostly __dev_queue_xmit,__dev_xmit_skb,ip6_output,__ip6_finish_output(tx);ip6_rcv_core(rx)
+xdp_features_t xdp_features
+struct net_device_ops* netdev_ops read_mostly netdev_core_pick_tx,netdev_start_xmit(tx)
+struct xdp_metadata_ops* xdp_metadata_ops
+int ifindex read_mostly ip6_rcv_core
+unsigned_short gflags
+unsigned_short hard_header_len read_mostly read_mostly ip6_xmit(tx);gro_list_prepare(rx)
+unsigned_int mtu read_mostly ip_finish_output2
+unsigned_short needed_headroom read_mostly LL_RESERVED_SPACE/ip_finish_output2
+unsigned_short needed_tailroom
+netdev_features_t features read_mostly read_mostly HARD_TX_LOCK,netif_skb_features,sk_setup_caps(tx);netif_elide_gro(rx)
+netdev_features_t hw_features
+netdev_features_t wanted_features
+netdev_features_t vlan_features
+netdev_features_t hw_enc_features netif_skb_features
+netdev_features_t mpls_features
+netdev_features_t gso_partial_features read_mostly gso_features_check
+unsigned_int min_mtu
+unsigned_int max_mtu
+unsigned_short type
+unsigned_char min_header_len
+unsigned_char name_assign_type
+int group
+struct net_device_stats stats
+struct net_device_core_stats* core_stats
+atomic_t carrier_up_count
+atomic_t carrier_down_count
+struct iw_handler_def* wireless_handlers
+struct ethtool_ops* ethtool_ops
+struct l3mdev_ops* l3mdev_ops
+struct ndisc_ops* ndisc_ops
+struct xfrmdev_ops* xfrmdev_ops
+struct tlsdev_ops* tlsdev_ops
+struct header_ops* header_ops read_mostly ip_finish_output2,ip6_finish_output2(tx)
+unsigned_char operstate
+unsigned_char link_mode
+unsigned_char if_port
+unsigned_char dma
+unsigned_char perm_addr[32]
+unsigned_char addr_assign_type
+unsigned_char addr_len
+unsigned_char upper_level
+unsigned_char lower_level
+unsigned_short neigh_priv_len
+unsigned_short padded
+unsigned_short dev_id
+unsigned_short dev_port
+spinlock_t addr_list_lock
+int irq
+struct netdev_hw_addr_list uc
+struct netdev_hw_addr_list mc
+struct netdev_hw_addr_list dev_addrs
+struct kset* queues_kset
+struct list_head unlink_list
+unsigned_int promiscuity
+unsigned_int allmulti
+bool uc_promisc
+unsigned_char nested_level
+struct in_device* ip_ptr read_mostly read_mostly __in_dev_get
+struct hlist_head fib_nh_head
+struct inet6_dev* ip6_ptr read_mostly read_mostly __in6_dev_get
+struct vlan_info* vlan_info
+struct dsa_port* dsa_ptr
+struct tipc_bearer* tipc_ptr
+void* atalk_ptr
+void* ax25_ptr
+struct wireless_dev* ieee80211_ptr
+struct wpan_dev* ieee802154_ptr
+struct mpls_dev* mpls_ptr
+struct mctp_dev* mctp_ptr
+unsigned_char* dev_addr
+struct netdev_queue* _rx read_mostly netdev_get_rx_queue(rx)
+unsigned_int num_rx_queues
+unsigned_int real_num_rx_queues read_mostly get_rps_cpu
+struct bpf_prog* xdp_prog read_mostly netif_elide_gro()
+unsigned_long gro_flush_timeout read_mostly napi_complete_done
+u32 napi_defer_hard_irqs read_mostly napi_complete_done
+unsigned_int gro_max_size read_mostly skb_gro_receive
+unsigned_int gro_ipv4_max_size read_mostly skb_gro_receive
+rx_handler_func_t* rx_handler read_mostly __netif_receive_skb_core
+void* rx_handler_data read_mostly
+struct netdev_queue* ingress_queue read_mostly
+struct bpf_mprog_entry tcx_ingress read_mostly sch_handle_ingress
+struct nf_hook_entries* nf_hooks_ingress
+unsigned_char broadcast[32]
+struct cpu_rmap* rx_cpu_rmap
+struct hlist_node index_hlist
+struct netdev_queue* _tx read_mostly netdev_get_tx_queue(tx)
+unsigned_int num_tx_queues
+unsigned_int real_num_tx_queues read_mostly skb_tx_hash,netdev_core_pick_tx(tx)
+unsigned_int tx_queue_len
+spinlock_t tx_global_lock
+struct xdp_dev_bulk_queue__percpu* xdp_bulkq
+struct xps_dev_maps* xps_maps[2] read_mostly __netif_set_xps_queue
+struct bpf_mprog_entry tcx_egress read_mostly sch_handle_egress
+struct nf_hook_entries* nf_hooks_egress read_mostly
+struct hlist_head qdisc_hash[16]
+struct timer_list watchdog_timer
+int watchdog_timeo
+u32 proto_down_reason
+struct list_head todo_list
+int__percpu* pcpu_refcnt
+refcount_t dev_refcnt
+struct ref_tracker_dir refcnt_tracker
+struct list_head link_watch_list
+enum:8 reg_state
+bool dismantle
+enum:16 rtnl_link_state
+bool needs_free_netdev
+void*priv_destructor struct net_device
+struct netpoll_info* npinfo read_mostly napi_poll/napi_poll_lock
+possible_net_t nd_net read_mostly (dev_net)napi_busy_loop,tcp_v(4/6)_rcv,ip(v6)_rcv,ip(6)_input,ip(6)_input_finish
+void* ml_priv
+enum_netdev_ml_priv_type ml_priv_type
+struct pcpu_lstats__percpu* lstats read_mostly dev_lstats_add()
+struct pcpu_sw_netstats__percpu* tstats read_mostly dev_sw_netstats_tx_add()
+struct pcpu_dstats__percpu* dstats
+struct garp_port* garp_port
+struct mrp_port* mrp_port
+struct dm_hw_stat_delta* dm_private
+struct device dev
+struct attribute_group* sysfs_groups[4]
+struct attribute_group* sysfs_rx_queue_group
+struct rtnl_link_ops* rtnl_link_ops
+unsigned_int gso_max_size read_mostly sk_dst_gso_max_size
+unsigned_int tso_max_size
+u16 gso_max_segs read_mostly gso_max_segs
+u16 tso_max_segs
+unsigned_int gso_ipv4_max_size read_mostly sk_dst_gso_max_size
+struct dcbnl_rtnl_ops* dcbnl_ops
+s16 num_tc read_mostly skb_tx_hash
+struct netdev_tc_txq tc_to_txq[16] read_mostly skb_tx_hash
+u8 prio_tc_map[16]
+unsigned_int fcoe_ddp_xid
+struct netprio_map* priomap
+struct phy_device* phydev
+struct sfp_bus* sfp_bus
+struct lock_class_key* qdisc_tx_busylock
+bool proto_down
+unsigned:1 wol_enabled
+unsigned:1 threaded napi_poll(napi_enable,dev_set_threaded)
+unsigned_long:1 see_all_hwtstamp_requests
+unsigned_long:1 change_proto_down
+unsigned_long:1 netns_local
+unsigned_long:1 fcoe_mtu
+struct list_head net_notifier_list
+struct macsec_ops* macsec_ops
+struct udp_tunnel_nic_info* udp_tunnel_nic_info
+struct udp_tunnel_nic* udp_tunnel_nic
+unsigned_int xdp_zc_max_segs
+struct bpf_xdp_entity xdp_state[3]
+u8 dev_addr_shadow[32]
+netdevice_tracker linkwatch_dev_tracker
+netdevice_tracker watchdog_dev_tracker
+netdevice_tracker dev_registered_tracker
+struct rtnl_hw_stats64* offload_xstats_l3
+struct devlink_port* devlink_port
+struct dpll_pin* dpll_pin
+struct hlist_head page_pools
+struct dim_irq_moder* irq_moder
+u64 max_pacing_offload_horizon
+struct_napi_config* napi_config
+unsigned_long gro_flush_timeout
+u32 napi_defer_hard_irqs
+struct hlist_head neighbours[2]
+=================================== =========================== =================== =================== ===================================================================================
diff --git a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
index 9b87089a84c6..de0263302f16 100644
--- a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
+++ b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
@@ -5,154 +5,157 @@
netns_ipv4 struct fast path usage breakdown
===========================================
+=============================== ============================================ =================== =================== =================================================
Type Name fastpath_tx_access fastpath_rx_access comment
-..struct ..netns_ipv4
-struct_inet_timewait_death_row tcp_death_row
-struct_udp_table* udp_table
-struct_ctl_table_header* forw_hdr
-struct_ctl_table_header* frags_hdr
-struct_ctl_table_header* ipv4_hdr
-struct_ctl_table_header* route_hdr
-struct_ctl_table_header* xfrm4_hdr
-struct_ipv4_devconf* devconf_all
-struct_ipv4_devconf* devconf_dflt
-struct_ip_ra_chain ra_chain
-struct_mutex ra_mutex
-struct_fib_rules_ops* rules_ops
-struct_fib_table fib_main
-struct_fib_table fib_default
-unsigned_int fib_rules_require_fldissect
-bool fib_has_custom_rules
-bool fib_has_custom_local_routes
-bool fib_offload_disabled
-atomic_t fib_num_tclassid_users
-struct_hlist_head* fib_table_hash
-struct_sock* fibnl
-struct_sock* mc_autojoin_sk
-struct_inet_peer_base* peers
-struct_fqdir* fqdir
-u8 sysctl_icmp_echo_ignore_all
-u8 sysctl_icmp_echo_enable_probe
-u8 sysctl_icmp_echo_ignore_broadcasts
-u8 sysctl_icmp_ignore_bogus_error_responses
-u8 sysctl_icmp_errors_use_inbound_ifaddr
-int sysctl_icmp_ratelimit
-int sysctl_icmp_ratemask
-u32 ip_rt_min_pmtu - -
-int ip_rt_mtu_expires - -
-int ip_rt_min_advmss - -
-struct_local_ports ip_local_ports - -
-u8 sysctl_tcp_ecn - -
-u8 sysctl_tcp_ecn_fallback - -
-u8 sysctl_ip_default_ttl - - ip4_dst_hoplimit/ip_select_ttl
-u8 sysctl_ip_no_pmtu_disc - -
-u8 sysctl_ip_fwd_use_pmtu read_mostly - ip_dst_mtu_maybe_forward/ip_skb_dst_mtu
-u8 sysctl_ip_fwd_update_priority - - ip_forward
-u8 sysctl_ip_nonlocal_bind - -
-u8 sysctl_ip_autobind_reuse - -
-u8 sysctl_ip_dynaddr - -
-u8 sysctl_ip_early_demux - read_mostly ip(6)_rcv_finish_core
-u8 sysctl_raw_l3mdev_accept - -
-u8 sysctl_tcp_early_demux - read_mostly ip(6)_rcv_finish_core
-u8 sysctl_udp_early_demux
-u8 sysctl_nexthop_compat_mode - -
-u8 sysctl_fwmark_reflect - -
-u8 sysctl_tcp_fwmark_accept - -
-u8 sysctl_tcp_l3mdev_accept - -
-u8 sysctl_tcp_mtu_probing - -
-int sysctl_tcp_mtu_probe_floor - -
-int sysctl_tcp_base_mss - -
-int sysctl_tcp_min_snd_mss read_mostly - __tcp_mtu_to_mss(tcp_write_xmit)
-int sysctl_tcp_probe_threshold - - tcp_mtu_probe(tcp_write_xmit)
-u32 sysctl_tcp_probe_interval - - tcp_mtu_check_reprobe(tcp_write_xmit)
-int sysctl_tcp_keepalive_time - -
-int sysctl_tcp_keepalive_intvl - -
-u8 sysctl_tcp_keepalive_probes - -
-u8 sysctl_tcp_syn_retries - -
-u8 sysctl_tcp_synack_retries - -
-u8 sysctl_tcp_syncookies - - generated_on_syn
-u8 sysctl_tcp_migrate_req - - reuseport
-u8 sysctl_tcp_comp_sack_nr - - __tcp_ack_snd_check
-int sysctl_tcp_reordering - read_mostly tcp_may_raise_cwnd/tcp_cong_control
-u8 sysctl_tcp_retries1 - -
-u8 sysctl_tcp_retries2 - -
-u8 sysctl_tcp_orphan_retries - -
-u8 sysctl_tcp_tw_reuse - - timewait_sock_ops
-int sysctl_tcp_fin_timeout - - TCP_LAST_ACK/tcp_rcv_state_process
-unsigned_int sysctl_tcp_notsent_lowat read_mostly - tcp_notsent_lowat/tcp_stream_memory_free
-u8 sysctl_tcp_sack - - tcp_syn_options
-u8 sysctl_tcp_window_scaling - - tcp_syn_options,tcp_parse_options
-u8 sysctl_tcp_timestamps
-u8 sysctl_tcp_early_retrans read_mostly - tcp_schedule_loss_probe(tcp_write_xmit)
-u8 sysctl_tcp_recovery - - tcp_fastretrans_alert
-u8 sysctl_tcp_thin_linear_timeouts - - tcp_retrans_timer(on_thin_streams)
-u8 sysctl_tcp_slow_start_after_idle - - unlikely(tcp_cwnd_validate-network-not-starved)
-u8 sysctl_tcp_retrans_collapse - -
-u8 sysctl_tcp_stdurg - - unlikely(tcp_check_urg)
-u8 sysctl_tcp_rfc1337 - -
-u8 sysctl_tcp_abort_on_overflow - -
-u8 sysctl_tcp_fack - -
-int sysctl_tcp_max_reordering - - tcp_check_sack_reordering
-int sysctl_tcp_adv_win_scale - - tcp_init_buffer_space
-u8 sysctl_tcp_dsack - - partial_packet_or_retrans_in_tcp_data_queue
-u8 sysctl_tcp_app_win - - tcp_win_from_space
-u8 sysctl_tcp_frto - - tcp_enter_loss
-u8 sysctl_tcp_nometrics_save - - TCP_LAST_ACK/tcp_update_metrics
-u8 sysctl_tcp_no_ssthresh_metrics_save - - TCP_LAST_ACK/tcp_(update/init)_metrics
+=============================== ============================================ =================== =================== =================================================
+struct_inet_timewait_death_row tcp_death_row
+struct_udp_table* udp_table
+struct_ctl_table_header* forw_hdr
+struct_ctl_table_header* frags_hdr
+struct_ctl_table_header* ipv4_hdr
+struct_ctl_table_header* route_hdr
+struct_ctl_table_header* xfrm4_hdr
+struct_ipv4_devconf* devconf_all
+struct_ipv4_devconf* devconf_dflt
+struct_ip_ra_chain ra_chain
+struct_mutex ra_mutex
+struct_fib_rules_ops* rules_ops
+struct_fib_table fib_main
+struct_fib_table fib_default
+unsigned_int fib_rules_require_fldissect
+bool fib_has_custom_rules
+bool fib_has_custom_local_routes
+bool fib_offload_disabled
+atomic_t fib_num_tclassid_users
+struct_hlist_head* fib_table_hash
+struct_sock* fibnl
+struct_sock* mc_autojoin_sk
+struct_inet_peer_base* peers
+struct_fqdir* fqdir
+u8 sysctl_icmp_echo_ignore_all
+u8 sysctl_icmp_echo_enable_probe
+u8 sysctl_icmp_echo_ignore_broadcasts
+u8 sysctl_icmp_ignore_bogus_error_responses
+u8 sysctl_icmp_errors_use_inbound_ifaddr
+int sysctl_icmp_ratelimit
+int sysctl_icmp_ratemask
+u32 ip_rt_min_pmtu
+int ip_rt_mtu_expires
+int ip_rt_min_advmss
+struct_local_ports ip_local_ports
+u8 sysctl_tcp_ecn
+u8 sysctl_tcp_ecn_fallback
+u8 sysctl_ip_default_ttl ip4_dst_hoplimit/ip_select_ttl
+u8 sysctl_ip_no_pmtu_disc
+u8 sysctl_ip_fwd_use_pmtu read_mostly ip_dst_mtu_maybe_forward/ip_skb_dst_mtu
+u8 sysctl_ip_fwd_update_priority ip_forward
+u8 sysctl_ip_nonlocal_bind
+u8 sysctl_ip_autobind_reuse
+u8 sysctl_ip_dynaddr
+u8 sysctl_ip_early_demux read_mostly ip(6)_rcv_finish_core
+u8 sysctl_raw_l3mdev_accept
+u8 sysctl_tcp_early_demux read_mostly ip(6)_rcv_finish_core
+u8 sysctl_udp_early_demux
+u8 sysctl_nexthop_compat_mode
+u8 sysctl_fwmark_reflect
+u8 sysctl_tcp_fwmark_accept
+u8 sysctl_tcp_l3mdev_accept read_mostly __inet6_lookup_established/inet_request_bound_dev_if
+u8 sysctl_tcp_mtu_probing
+int sysctl_tcp_mtu_probe_floor
+int sysctl_tcp_base_mss
+int sysctl_tcp_min_snd_mss read_mostly __tcp_mtu_to_mss(tcp_write_xmit)
+int sysctl_tcp_probe_threshold tcp_mtu_probe(tcp_write_xmit)
+u32 sysctl_tcp_probe_interval tcp_mtu_check_reprobe(tcp_write_xmit)
+int sysctl_tcp_keepalive_time
+int sysctl_tcp_keepalive_intvl
+u8 sysctl_tcp_keepalive_probes
+u8 sysctl_tcp_syn_retries
+u8 sysctl_tcp_synack_retries
+u8 sysctl_tcp_syncookies generated_on_syn
+u8 sysctl_tcp_migrate_req reuseport
+u8 sysctl_tcp_comp_sack_nr __tcp_ack_snd_check
+int sysctl_tcp_reordering read_mostly tcp_may_raise_cwnd/tcp_cong_control
+u8 sysctl_tcp_retries1
+u8 sysctl_tcp_retries2
+u8 sysctl_tcp_orphan_retries
+u8 sysctl_tcp_tw_reuse timewait_sock_ops
+unsigned_int sysctl_tcp_tw_reuse_delay timewait_sock_ops
+int sysctl_tcp_fin_timeout TCP_LAST_ACK/tcp_rcv_state_process
+unsigned_int sysctl_tcp_notsent_lowat read_mostly tcp_notsent_lowat/tcp_stream_memory_free
+u8 sysctl_tcp_sack tcp_syn_options
+u8 sysctl_tcp_window_scaling tcp_syn_options,tcp_parse_options
+u8 sysctl_tcp_timestamps
+u8 sysctl_tcp_early_retrans read_mostly tcp_schedule_loss_probe(tcp_write_xmit)
+u8 sysctl_tcp_recovery tcp_fastretrans_alert
+u8 sysctl_tcp_thin_linear_timeouts tcp_retrans_timer(on_thin_streams)
+u8 sysctl_tcp_slow_start_after_idle unlikely(tcp_cwnd_validate-network-not-starved)
+u8 sysctl_tcp_retrans_collapse
+u8 sysctl_tcp_stdurg unlikely(tcp_check_urg)
+u8 sysctl_tcp_rfc1337
+u8 sysctl_tcp_abort_on_overflow
+u8 sysctl_tcp_fack
+int sysctl_tcp_max_reordering tcp_check_sack_reordering
+int sysctl_tcp_adv_win_scale tcp_init_buffer_space
+u8 sysctl_tcp_dsack partial_packet_or_retrans_in_tcp_data_queue
+u8 sysctl_tcp_app_win tcp_win_from_space
+u8 sysctl_tcp_frto tcp_enter_loss
+u8 sysctl_tcp_nometrics_save TCP_LAST_ACK/tcp_update_metrics
+u8 sysctl_tcp_no_ssthresh_metrics_save TCP_LAST_ACK/tcp_(update/init)_metrics
u8 sysctl_tcp_moderate_rcvbuf read_mostly read_mostly tcp_tso_should_defer(tx);tcp_rcv_space_adjust(rx)
-u8 sysctl_tcp_tso_win_divisor read_mostly - tcp_tso_should_defer(tcp_write_xmit)
-u8 sysctl_tcp_workaround_signed_windows - - tcp_select_window
-int sysctl_tcp_limit_output_bytes read_mostly - tcp_small_queue_check(tcp_write_xmit)
-int sysctl_tcp_challenge_ack_limit - -
-int sysctl_tcp_min_rtt_wlen read_mostly - tcp_ack_update_rtt
-u8 sysctl_tcp_min_tso_segs - - unlikely(icsk_ca_ops-written)
-u8 sysctl_tcp_tso_rtt_log read_mostly - tcp_tso_autosize
-u8 sysctl_tcp_autocorking read_mostly - tcp_push/tcp_should_autocork
-u8 sysctl_tcp_reflect_tos - - tcp_v(4/6)_send_synack
-int sysctl_tcp_invalid_ratelimit - -
-int sysctl_tcp_pacing_ss_ratio - - default_cong_cont(tcp_update_pacing_rate)
-int sysctl_tcp_pacing_ca_ratio - - default_cong_cont(tcp_update_pacing_rate)
-int sysctl_tcp_wmem[3] read_mostly - tcp_wmem_schedule(sendmsg/sendpage)
-int sysctl_tcp_rmem[3] - read_mostly __tcp_grow_window(tx),tcp_rcv_space_adjust(rx)
-unsigned_int sysctl_tcp_child_ehash_entries
-unsigned_long sysctl_tcp_comp_sack_delay_ns - - __tcp_ack_snd_check
-unsigned_long sysctl_tcp_comp_sack_slack_ns - - __tcp_ack_snd_check
-int sysctl_max_syn_backlog - -
-int sysctl_tcp_fastopen - -
-struct_tcp_congestion_ops tcp_congestion_control - - init_cc
-struct_tcp_fastopen_context tcp_fastopen_ctx - -
-unsigned_int sysctl_tcp_fastopen_blackhole_timeout - -
-atomic_t tfo_active_disable_times - -
-unsigned_long tfo_active_disable_stamp - -
-u32 tcp_challenge_timestamp - -
-u32 tcp_challenge_count - -
-u8 sysctl_tcp_plb_enabled - -
-u8 sysctl_tcp_plb_idle_rehash_rounds - -
-u8 sysctl_tcp_plb_rehash_rounds - -
-u8 sysctl_tcp_plb_suspend_rto_sec - -
-int sysctl_tcp_plb_cong_thresh - -
-int sysctl_udp_wmem_min
-int sysctl_udp_rmem_min
-u8 sysctl_fib_notify_on_flag_change
-u8 sysctl_udp_l3mdev_accept
-u8 sysctl_igmp_llm_reports
-int sysctl_igmp_max_memberships
-int sysctl_igmp_max_msf
-int sysctl_igmp_qrv
-struct_ping_group_range ping_group_range
-atomic_t dev_addr_genid
-unsigned_int sysctl_udp_child_hash_entries
-unsigned_long* sysctl_local_reserved_ports
-int sysctl_ip_prot_sock
-struct_mr_table* mrt
-struct_list_head mr_tables
-struct_fib_rules_ops* mr_rules_ops
-u32 sysctl_fib_multipath_hash_fields
-u8 sysctl_fib_multipath_use_neigh
-u8 sysctl_fib_multipath_hash_policy
-struct_fib_notifier_ops* notifier_ops
-unsigned_int fib_seq
-struct_fib_notifier_ops* ipmr_notifier_ops
-unsigned_int ipmr_seq
-atomic_t rt_genid
-siphash_key_t ip_id_key
+u8 sysctl_tcp_tso_win_divisor read_mostly tcp_tso_should_defer(tcp_write_xmit)
+u8 sysctl_tcp_workaround_signed_windows tcp_select_window
+int sysctl_tcp_limit_output_bytes read_mostly tcp_small_queue_check(tcp_write_xmit)
+int sysctl_tcp_challenge_ack_limit
+int sysctl_tcp_min_rtt_wlen read_mostly tcp_ack_update_rtt
+u8 sysctl_tcp_min_tso_segs unlikely(icsk_ca_ops-written)
+u8 sysctl_tcp_tso_rtt_log read_mostly tcp_tso_autosize
+u8 sysctl_tcp_autocorking read_mostly tcp_push/tcp_should_autocork
+u8 sysctl_tcp_reflect_tos tcp_v(4/6)_send_synack
+int sysctl_tcp_invalid_ratelimit
+int sysctl_tcp_pacing_ss_ratio default_cong_cont(tcp_update_pacing_rate)
+int sysctl_tcp_pacing_ca_ratio default_cong_cont(tcp_update_pacing_rate)
+int sysctl_tcp_wmem[3] read_mostly tcp_wmem_schedule(sendmsg/sendpage)
+int sysctl_tcp_rmem[3] read_mostly __tcp_grow_window(tx),tcp_rcv_space_adjust(rx)
+unsigned_int sysctl_tcp_child_ehash_entries
+unsigned_long sysctl_tcp_comp_sack_delay_ns __tcp_ack_snd_check
+unsigned_long sysctl_tcp_comp_sack_slack_ns __tcp_ack_snd_check
+int sysctl_max_syn_backlog
+int sysctl_tcp_fastopen
+struct_tcp_congestion_ops tcp_congestion_control init_cc
+struct_tcp_fastopen_context tcp_fastopen_ctx
+unsigned_int sysctl_tcp_fastopen_blackhole_timeout
+atomic_t tfo_active_disable_times
+unsigned_long tfo_active_disable_stamp
+u32 tcp_challenge_timestamp
+u32 tcp_challenge_count
+u8 sysctl_tcp_plb_enabled
+u8 sysctl_tcp_plb_idle_rehash_rounds
+u8 sysctl_tcp_plb_rehash_rounds
+u8 sysctl_tcp_plb_suspend_rto_sec
+int sysctl_tcp_plb_cong_thresh
+int sysctl_udp_wmem_min
+int sysctl_udp_rmem_min
+u8 sysctl_fib_notify_on_flag_change
+u8 sysctl_udp_l3mdev_accept
+u8 sysctl_igmp_llm_reports
+int sysctl_igmp_max_memberships
+int sysctl_igmp_max_msf
+int sysctl_igmp_qrv
+struct_ping_group_range ping_group_range
+atomic_t dev_addr_genid
+unsigned_int sysctl_udp_child_hash_entries
+unsigned_long* sysctl_local_reserved_ports
+int sysctl_ip_prot_sock
+struct_mr_table* mrt
+struct_list_head mr_tables
+struct_fib_rules_ops* mr_rules_ops
+u32 sysctl_fib_multipath_hash_fields
+u8 sysctl_fib_multipath_use_neigh
+u8 sysctl_fib_multipath_hash_policy
+struct_fib_notifier_ops* notifier_ops
+unsigned_int fib_seq
+struct_fib_notifier_ops* ipmr_notifier_ops
+unsigned_int ipmr_seq
+atomic_t rt_genid
+siphash_key_t ip_id_key
+=============================== ============================================ =================== =================== =================================================
diff --git a/Documentation/networking/net_cachelines/snmp.rst b/Documentation/networking/net_cachelines/snmp.rst
index 6a071538566c..90ca2d92547d 100644
--- a/Documentation/networking/net_cachelines/snmp.rst
+++ b/Documentation/networking/net_cachelines/snmp.rst
@@ -5,131 +5,133 @@
netns_ipv4 enum fast path usage breakdown
===========================================
+============== ===================================== =================== =================== ==================================================
Type Name fastpath_tx_access fastpath_rx_access comment
-..enum
-unsigned_long LINUX_MIB_TCPKEEPALIVE write_mostly - tcp_keepalive_timer
-unsigned_long LINUX_MIB_DELAYEDACKS write_mostly - tcp_delack_timer_handler,tcp_delack_timer
-unsigned_long LINUX_MIB_DELAYEDACKLOCKED write_mostly - tcp_delack_timer_handler,tcp_delack_timer
-unsigned_long LINUX_MIB_TCPAUTOCORKING write_mostly - tcp_push,tcp_sendmsg_locked
-unsigned_long LINUX_MIB_TCPFROMZEROWINDOWADV write_mostly - tcp_select_window,tcp_transmit-skb
-unsigned_long LINUX_MIB_TCPTOZEROWINDOWADV write_mostly - tcp_select_window,tcp_transmit-skb
-unsigned_long LINUX_MIB_TCPWANTZEROWINDOWADV write_mostly - tcp_select_window,tcp_transmit-skb
-unsigned_long LINUX_MIB_TCPORIGDATASENT write_mostly - tcp_write_xmit
-unsigned_long LINUX_MIB_TCPHPHITS - write_mostly tcp_rcv_established,tcp_v4_do_rcv,tcp_v6_do_rcv
-unsigned_long LINUX_MIB_TCPRCVCOALESCE - write_mostly tcp_try_coalesce,tcp_queue_rcv,tcp_rcv_established
-unsigned_long LINUX_MIB_TCPPUREACKS - write_mostly tcp_ack,tcp_rcv_established
-unsigned_long LINUX_MIB_TCPHPACKS - write_mostly tcp_ack,tcp_rcv_established
-unsigned_long LINUX_MIB_TCPDELIVERED - write_mostly tcp_newly_delivered,tcp_ack,tcp_rcv_established
-unsigned_long LINUX_MIB_SYNCOOKIESSENT
-unsigned_long LINUX_MIB_SYNCOOKIESRECV
-unsigned_long LINUX_MIB_SYNCOOKIESFAILED
-unsigned_long LINUX_MIB_EMBRYONICRSTS
-unsigned_long LINUX_MIB_PRUNECALLED
-unsigned_long LINUX_MIB_RCVPRUNED
-unsigned_long LINUX_MIB_OFOPRUNED
-unsigned_long LINUX_MIB_OUTOFWINDOWICMPS
-unsigned_long LINUX_MIB_LOCKDROPPEDICMPS
-unsigned_long LINUX_MIB_ARPFILTER
-unsigned_long LINUX_MIB_TIMEWAITED
-unsigned_long LINUX_MIB_TIMEWAITRECYCLED
-unsigned_long LINUX_MIB_TIMEWAITKILLED
-unsigned_long LINUX_MIB_PAWSACTIVEREJECTED
-unsigned_long LINUX_MIB_PAWSESTABREJECTED
-unsigned_long LINUX_MIB_DELAYEDACKLOST
-unsigned_long LINUX_MIB_LISTENOVERFLOWS
-unsigned_long LINUX_MIB_LISTENDROPS
-unsigned_long LINUX_MIB_TCPRENORECOVERY
-unsigned_long LINUX_MIB_TCPSACKRECOVERY
-unsigned_long LINUX_MIB_TCPSACKRENEGING
-unsigned_long LINUX_MIB_TCPSACKREORDER
-unsigned_long LINUX_MIB_TCPRENOREORDER
-unsigned_long LINUX_MIB_TCPTSREORDER
-unsigned_long LINUX_MIB_TCPFULLUNDO
-unsigned_long LINUX_MIB_TCPPARTIALUNDO
-unsigned_long LINUX_MIB_TCPDSACKUNDO
-unsigned_long LINUX_MIB_TCPLOSSUNDO
-unsigned_long LINUX_MIB_TCPLOSTRETRANSMIT
-unsigned_long LINUX_MIB_TCPRENOFAILURES
-unsigned_long LINUX_MIB_TCPSACKFAILURES
-unsigned_long LINUX_MIB_TCPLOSSFAILURES
-unsigned_long LINUX_MIB_TCPFASTRETRANS
-unsigned_long LINUX_MIB_TCPSLOWSTARTRETRANS
-unsigned_long LINUX_MIB_TCPTIMEOUTS
-unsigned_long LINUX_MIB_TCPLOSSPROBES
-unsigned_long LINUX_MIB_TCPLOSSPROBERECOVERY
-unsigned_long LINUX_MIB_TCPRENORECOVERYFAIL
-unsigned_long LINUX_MIB_TCPSACKRECOVERYFAIL
-unsigned_long LINUX_MIB_TCPRCVCOLLAPSED
-unsigned_long LINUX_MIB_TCPDSACKOLDSENT
-unsigned_long LINUX_MIB_TCPDSACKOFOSENT
-unsigned_long LINUX_MIB_TCPDSACKRECV
-unsigned_long LINUX_MIB_TCPDSACKOFORECV
-unsigned_long LINUX_MIB_TCPABORTONDATA
-unsigned_long LINUX_MIB_TCPABORTONCLOSE
-unsigned_long LINUX_MIB_TCPABORTONMEMORY
-unsigned_long LINUX_MIB_TCPABORTONTIMEOUT
-unsigned_long LINUX_MIB_TCPABORTONLINGER
-unsigned_long LINUX_MIB_TCPABORTFAILED
-unsigned_long LINUX_MIB_TCPMEMORYPRESSURES
-unsigned_long LINUX_MIB_TCPMEMORYPRESSURESCHRONO
-unsigned_long LINUX_MIB_TCPSACKDISCARD
-unsigned_long LINUX_MIB_TCPDSACKIGNOREDOLD
-unsigned_long LINUX_MIB_TCPDSACKIGNOREDNOUNDO
-unsigned_long LINUX_MIB_TCPSPURIOUSRTOS
-unsigned_long LINUX_MIB_TCPMD5NOTFOUND
-unsigned_long LINUX_MIB_TCPMD5UNEXPECTED
-unsigned_long LINUX_MIB_TCPMD5FAILURE
-unsigned_long LINUX_MIB_SACKSHIFTED
-unsigned_long LINUX_MIB_SACKMERGED
-unsigned_long LINUX_MIB_SACKSHIFTFALLBACK
-unsigned_long LINUX_MIB_TCPBACKLOGDROP
-unsigned_long LINUX_MIB_PFMEMALLOCDROP
-unsigned_long LINUX_MIB_TCPMINTTLDROP
-unsigned_long LINUX_MIB_TCPDEFERACCEPTDROP
-unsigned_long LINUX_MIB_IPRPFILTER
-unsigned_long LINUX_MIB_TCPTIMEWAITOVERFLOW
-unsigned_long LINUX_MIB_TCPREQQFULLDOCOOKIES
-unsigned_long LINUX_MIB_TCPREQQFULLDROP
-unsigned_long LINUX_MIB_TCPRETRANSFAIL
-unsigned_long LINUX_MIB_TCPBACKLOGCOALESCE
-unsigned_long LINUX_MIB_TCPOFOQUEUE
-unsigned_long LINUX_MIB_TCPOFODROP
-unsigned_long LINUX_MIB_TCPOFOMERGE
-unsigned_long LINUX_MIB_TCPCHALLENGEACK
-unsigned_long LINUX_MIB_TCPSYNCHALLENGE
-unsigned_long LINUX_MIB_TCPFASTOPENACTIVE
-unsigned_long LINUX_MIB_TCPFASTOPENACTIVEFAIL
-unsigned_long LINUX_MIB_TCPFASTOPENPASSIVE
-unsigned_long LINUX_MIB_TCPFASTOPENPASSIVEFAIL
-unsigned_long LINUX_MIB_TCPFASTOPENLISTENOVERFLOW
-unsigned_long LINUX_MIB_TCPFASTOPENCOOKIEREQD
-unsigned_long LINUX_MIB_TCPFASTOPENBLACKHOLE
-unsigned_long LINUX_MIB_TCPSPURIOUS_RTX_HOSTQUEUES
-unsigned_long LINUX_MIB_BUSYPOLLRXPACKETS
-unsigned_long LINUX_MIB_TCPSYNRETRANS
-unsigned_long LINUX_MIB_TCPHYSTARTTRAINDETECT
-unsigned_long LINUX_MIB_TCPHYSTARTTRAINCWND
-unsigned_long LINUX_MIB_TCPHYSTARTDELAYDETECT
-unsigned_long LINUX_MIB_TCPHYSTARTDELAYCWND
-unsigned_long LINUX_MIB_TCPACKSKIPPEDSYNRECV
-unsigned_long LINUX_MIB_TCPACKSKIPPEDPAWS
-unsigned_long LINUX_MIB_TCPACKSKIPPEDSEQ
-unsigned_long LINUX_MIB_TCPACKSKIPPEDFINWAIT2
-unsigned_long LINUX_MIB_TCPACKSKIPPEDTIMEWAIT
-unsigned_long LINUX_MIB_TCPACKSKIPPEDCHALLENGE
-unsigned_long LINUX_MIB_TCPWINPROBE
-unsigned_long LINUX_MIB_TCPMTUPFAIL
-unsigned_long LINUX_MIB_TCPMTUPSUCCESS
-unsigned_long LINUX_MIB_TCPDELIVEREDCE
-unsigned_long LINUX_MIB_TCPACKCOMPRESSED
-unsigned_long LINUX_MIB_TCPZEROWINDOWDROP
-unsigned_long LINUX_MIB_TCPRCVQDROP
-unsigned_long LINUX_MIB_TCPWQUEUETOOBIG
-unsigned_long LINUX_MIB_TCPFASTOPENPASSIVEALTKEY
-unsigned_long LINUX_MIB_TCPTIMEOUTREHASH
-unsigned_long LINUX_MIB_TCPDUPLICATEDATAREHASH
-unsigned_long LINUX_MIB_TCPDSACKRECVSEGS
-unsigned_long LINUX_MIB_TCPDSACKIGNOREDDUBIOUS
-unsigned_long LINUX_MIB_TCPMIGRATEREQSUCCESS
-unsigned_long LINUX_MIB_TCPMIGRATEREQFAILURE
-unsigned_long __LINUX_MIB_MAX
+============== ===================================== =================== =================== ==================================================
+unsigned_long LINUX_MIB_TCPKEEPALIVE write_mostly tcp_keepalive_timer
+unsigned_long LINUX_MIB_DELAYEDACKS write_mostly tcp_delack_timer_handler,tcp_delack_timer
+unsigned_long LINUX_MIB_DELAYEDACKLOCKED write_mostly tcp_delack_timer_handler,tcp_delack_timer
+unsigned_long LINUX_MIB_TCPAUTOCORKING write_mostly tcp_push,tcp_sendmsg_locked
+unsigned_long LINUX_MIB_TCPFROMZEROWINDOWADV write_mostly tcp_select_window,tcp_transmit-skb
+unsigned_long LINUX_MIB_TCPTOZEROWINDOWADV write_mostly tcp_select_window,tcp_transmit-skb
+unsigned_long LINUX_MIB_TCPWANTZEROWINDOWADV write_mostly tcp_select_window,tcp_transmit-skb
+unsigned_long LINUX_MIB_TCPORIGDATASENT write_mostly tcp_write_xmit
+unsigned_long LINUX_MIB_TCPHPHITS write_mostly tcp_rcv_established,tcp_v4_do_rcv,tcp_v6_do_rcv
+unsigned_long LINUX_MIB_TCPRCVCOALESCE write_mostly tcp_try_coalesce,tcp_queue_rcv,tcp_rcv_established
+unsigned_long LINUX_MIB_TCPPUREACKS write_mostly tcp_ack,tcp_rcv_established
+unsigned_long LINUX_MIB_TCPHPACKS write_mostly tcp_ack,tcp_rcv_established
+unsigned_long LINUX_MIB_TCPDELIVERED write_mostly tcp_newly_delivered,tcp_ack,tcp_rcv_established
+unsigned_long LINUX_MIB_SYNCOOKIESSENT
+unsigned_long LINUX_MIB_SYNCOOKIESRECV
+unsigned_long LINUX_MIB_SYNCOOKIESFAILED
+unsigned_long LINUX_MIB_EMBRYONICRSTS
+unsigned_long LINUX_MIB_PRUNECALLED
+unsigned_long LINUX_MIB_RCVPRUNED
+unsigned_long LINUX_MIB_OFOPRUNED
+unsigned_long LINUX_MIB_OUTOFWINDOWICMPS
+unsigned_long LINUX_MIB_LOCKDROPPEDICMPS
+unsigned_long LINUX_MIB_ARPFILTER
+unsigned_long LINUX_MIB_TIMEWAITED
+unsigned_long LINUX_MIB_TIMEWAITRECYCLED
+unsigned_long LINUX_MIB_TIMEWAITKILLED
+unsigned_long LINUX_MIB_PAWSACTIVEREJECTED
+unsigned_long LINUX_MIB_PAWSESTABREJECTED
+unsigned_long LINUX_MIB_DELAYEDACKLOST
+unsigned_long LINUX_MIB_LISTENOVERFLOWS
+unsigned_long LINUX_MIB_LISTENDROPS
+unsigned_long LINUX_MIB_TCPRENORECOVERY
+unsigned_long LINUX_MIB_TCPSACKRECOVERY
+unsigned_long LINUX_MIB_TCPSACKRENEGING
+unsigned_long LINUX_MIB_TCPSACKREORDER
+unsigned_long LINUX_MIB_TCPRENOREORDER
+unsigned_long LINUX_MIB_TCPTSREORDER
+unsigned_long LINUX_MIB_TCPFULLUNDO
+unsigned_long LINUX_MIB_TCPPARTIALUNDO
+unsigned_long LINUX_MIB_TCPDSACKUNDO
+unsigned_long LINUX_MIB_TCPLOSSUNDO
+unsigned_long LINUX_MIB_TCPLOSTRETRANSMIT
+unsigned_long LINUX_MIB_TCPRENOFAILURES
+unsigned_long LINUX_MIB_TCPSACKFAILURES
+unsigned_long LINUX_MIB_TCPLOSSFAILURES
+unsigned_long LINUX_MIB_TCPFASTRETRANS
+unsigned_long LINUX_MIB_TCPSLOWSTARTRETRANS
+unsigned_long LINUX_MIB_TCPTIMEOUTS
+unsigned_long LINUX_MIB_TCPLOSSPROBES
+unsigned_long LINUX_MIB_TCPLOSSPROBERECOVERY
+unsigned_long LINUX_MIB_TCPRENORECOVERYFAIL
+unsigned_long LINUX_MIB_TCPSACKRECOVERYFAIL
+unsigned_long LINUX_MIB_TCPRCVCOLLAPSED
+unsigned_long LINUX_MIB_TCPDSACKOLDSENT
+unsigned_long LINUX_MIB_TCPDSACKOFOSENT
+unsigned_long LINUX_MIB_TCPDSACKRECV
+unsigned_long LINUX_MIB_TCPDSACKOFORECV
+unsigned_long LINUX_MIB_TCPABORTONDATA
+unsigned_long LINUX_MIB_TCPABORTONCLOSE
+unsigned_long LINUX_MIB_TCPABORTONMEMORY
+unsigned_long LINUX_MIB_TCPABORTONTIMEOUT
+unsigned_long LINUX_MIB_TCPABORTONLINGER
+unsigned_long LINUX_MIB_TCPABORTFAILED
+unsigned_long LINUX_MIB_TCPMEMORYPRESSURES
+unsigned_long LINUX_MIB_TCPMEMORYPRESSURESCHRONO
+unsigned_long LINUX_MIB_TCPSACKDISCARD
+unsigned_long LINUX_MIB_TCPDSACKIGNOREDOLD
+unsigned_long LINUX_MIB_TCPDSACKIGNOREDNOUNDO
+unsigned_long LINUX_MIB_TCPSPURIOUSRTOS
+unsigned_long LINUX_MIB_TCPMD5NOTFOUND
+unsigned_long LINUX_MIB_TCPMD5UNEXPECTED
+unsigned_long LINUX_MIB_TCPMD5FAILURE
+unsigned_long LINUX_MIB_SACKSHIFTED
+unsigned_long LINUX_MIB_SACKMERGED
+unsigned_long LINUX_MIB_SACKSHIFTFALLBACK
+unsigned_long LINUX_MIB_TCPBACKLOGDROP
+unsigned_long LINUX_MIB_PFMEMALLOCDROP
+unsigned_long LINUX_MIB_TCPMINTTLDROP
+unsigned_long LINUX_MIB_TCPDEFERACCEPTDROP
+unsigned_long LINUX_MIB_IPRPFILTER
+unsigned_long LINUX_MIB_TCPTIMEWAITOVERFLOW
+unsigned_long LINUX_MIB_TCPREQQFULLDOCOOKIES
+unsigned_long LINUX_MIB_TCPREQQFULLDROP
+unsigned_long LINUX_MIB_TCPRETRANSFAIL
+unsigned_long LINUX_MIB_TCPBACKLOGCOALESCE
+unsigned_long LINUX_MIB_TCPOFOQUEUE
+unsigned_long LINUX_MIB_TCPOFODROP
+unsigned_long LINUX_MIB_TCPOFOMERGE
+unsigned_long LINUX_MIB_TCPCHALLENGEACK
+unsigned_long LINUX_MIB_TCPSYNCHALLENGE
+unsigned_long LINUX_MIB_TCPFASTOPENACTIVE
+unsigned_long LINUX_MIB_TCPFASTOPENACTIVEFAIL
+unsigned_long LINUX_MIB_TCPFASTOPENPASSIVE
+unsigned_long LINUX_MIB_TCPFASTOPENPASSIVEFAIL
+unsigned_long LINUX_MIB_TCPFASTOPENLISTENOVERFLOW
+unsigned_long LINUX_MIB_TCPFASTOPENCOOKIEREQD
+unsigned_long LINUX_MIB_TCPFASTOPENBLACKHOLE
+unsigned_long LINUX_MIB_TCPSPURIOUS_RTX_HOSTQUEUES
+unsigned_long LINUX_MIB_BUSYPOLLRXPACKETS
+unsigned_long LINUX_MIB_TCPSYNRETRANS
+unsigned_long LINUX_MIB_TCPHYSTARTTRAINDETECT
+unsigned_long LINUX_MIB_TCPHYSTARTTRAINCWND
+unsigned_long LINUX_MIB_TCPHYSTARTDELAYDETECT
+unsigned_long LINUX_MIB_TCPHYSTARTDELAYCWND
+unsigned_long LINUX_MIB_TCPACKSKIPPEDSYNRECV
+unsigned_long LINUX_MIB_TCPACKSKIPPEDPAWS
+unsigned_long LINUX_MIB_TCPACKSKIPPEDSEQ
+unsigned_long LINUX_MIB_TCPACKSKIPPEDFINWAIT2
+unsigned_long LINUX_MIB_TCPACKSKIPPEDTIMEWAIT
+unsigned_long LINUX_MIB_TCPACKSKIPPEDCHALLENGE
+unsigned_long LINUX_MIB_TCPWINPROBE
+unsigned_long LINUX_MIB_TCPMTUPFAIL
+unsigned_long LINUX_MIB_TCPMTUPSUCCESS
+unsigned_long LINUX_MIB_TCPDELIVEREDCE
+unsigned_long LINUX_MIB_TCPACKCOMPRESSED
+unsigned_long LINUX_MIB_TCPZEROWINDOWDROP
+unsigned_long LINUX_MIB_TCPRCVQDROP
+unsigned_long LINUX_MIB_TCPWQUEUETOOBIG
+unsigned_long LINUX_MIB_TCPFASTOPENPASSIVEALTKEY
+unsigned_long LINUX_MIB_TCPTIMEOUTREHASH
+unsigned_long LINUX_MIB_TCPDUPLICATEDATAREHASH
+unsigned_long LINUX_MIB_TCPDSACKRECVSEGS
+unsigned_long LINUX_MIB_TCPDSACKIGNOREDDUBIOUS
+unsigned_long LINUX_MIB_TCPMIGRATEREQSUCCESS
+unsigned_long LINUX_MIB_TCPMIGRATEREQFAILURE
+unsigned_long __LINUX_MIB_MAX
+============== ===================================== =================== =================== ==================================================
diff --git a/Documentation/networking/net_cachelines/tcp_sock.rst b/Documentation/networking/net_cachelines/tcp_sock.rst
index 1c154cbd1848..1f79765072b1 100644
--- a/Documentation/networking/net_cachelines/tcp_sock.rst
+++ b/Documentation/networking/net_cachelines/tcp_sock.rst
@@ -5,153 +5,155 @@
tcp_sock struct fast path usage breakdown
=========================================
+============================= ======================= =================== =================== ==================================================================================================================================================================================================================
Type Name fastpath_tx_access fastpath_rx_access Comments
-..struct ..tcp_sock
-struct_inet_connection_sock inet_conn
+============================= ======================= =================== =================== ==================================================================================================================================================================================================================
+struct inet_connection_sock inet_conn
u16 tcp_header_len read_mostly read_mostly tcp_bound_to_half_wnd,tcp_current_mss(tx);tcp_rcv_established(rx)
-u16 gso_segs read_mostly - tcp_xmit_size_goal
+u16 gso_segs read_mostly tcp_xmit_size_goal
__be32 pred_flags read_write read_mostly tcp_select_window(tx);tcp_rcv_established(rx)
-u64 bytes_received - read_write tcp_rcv_nxt_update(rx)
-u32 segs_in - read_write tcp_v6_rcv(rx)
-u32 data_segs_in - read_write tcp_v6_rcv(rx)
+u64 bytes_received read_write tcp_rcv_nxt_update(rx)
+u32 segs_in read_write tcp_v6_rcv(rx)
+u32 data_segs_in read_write tcp_v6_rcv(rx)
u32 rcv_nxt read_mostly read_write tcp_cleanup_rbuf,tcp_send_ack,tcp_inq_hint,tcp_transmit_skb,tcp_receive_window(tx);tcp_v6_do_rcv,tcp_rcv_established,tcp_data_queue,tcp_receive_window,tcp_rcv_nxt_update(write)(rx)
-u32 copied_seq - read_mostly tcp_cleanup_rbuf,tcp_rcv_space_adjust,tcp_inq_hint
-u32 rcv_wup - read_write __tcp_cleanup_rbuf,tcp_receive_window,tcp_receive_established
+u32 copied_seq read_mostly tcp_cleanup_rbuf,tcp_rcv_space_adjust,tcp_inq_hint
+u32 rcv_wup read_write __tcp_cleanup_rbuf,tcp_receive_window,tcp_receive_established
u32 snd_nxt read_write read_mostly tcp_rate_check_app_limited,__tcp_transmit_skb,tcp_event_new_data_sent(write)(tx);tcp_rcv_established,tcp_ack,tcp_clean_rtx_queue(rx)
-u32 segs_out read_write - __tcp_transmit_skb
-u32 data_segs_out read_write - __tcp_transmit_skb,tcp_update_skb_after_send
-u64 bytes_sent read_write - __tcp_transmit_skb
-u64 bytes_acked - read_write tcp_snd_una_update/tcp_ack
-u32 dsack_dups
+u32 segs_out read_write __tcp_transmit_skb
+u32 data_segs_out read_write __tcp_transmit_skb,tcp_update_skb_after_send
+u64 bytes_sent read_write __tcp_transmit_skb
+u64 bytes_acked read_write tcp_snd_una_update/tcp_ack
+u32 dsack_dups
u32 snd_una read_mostly read_write tcp_wnd_end,tcp_urg_mode,tcp_minshall_check,tcp_cwnd_validate(tx);tcp_ack,tcp_may_update_window,tcp_clean_rtx_queue(write),tcp_ack_tstamp(rx)
-u32 snd_sml read_write - tcp_minshall_check,tcp_minshall_update
-u32 rcv_tstamp - read_mostly tcp_ack
-u32 lsndtime read_write - tcp_slow_start_after_idle_check,tcp_event_data_sent
-u32 last_oow_ack_time
-u32 compressed_ack_rcv_nxt
+u32 snd_sml read_write tcp_minshall_check,tcp_minshall_update
+u32 rcv_tstamp read_mostly tcp_ack
+u32 lsndtime read_write tcp_slow_start_after_idle_check,tcp_event_data_sent
+u32 last_oow_ack_time
+u32 compressed_ack_rcv_nxt
u32 tsoffset read_mostly read_mostly tcp_established_options(tx);tcp_fast_parse_options(rx)
-struct_list_head tsq_node - -
-struct_list_head tsorted_sent_queue read_write - tcp_update_skb_after_send
-u32 snd_wl1 - read_mostly tcp_may_update_window
+struct list_head tsq_node
+struct list_head tsorted_sent_queue read_write tcp_update_skb_after_send
+u32 snd_wl1 read_mostly tcp_may_update_window
u32 snd_wnd read_mostly read_mostly tcp_wnd_end,tcp_tso_should_defer(tx);tcp_fast_path_on(rx)
-u32 max_window read_mostly - tcp_bound_to_half_wnd,forced_push
+u32 max_window read_mostly tcp_bound_to_half_wnd,forced_push
u32 mss_cache read_mostly read_mostly tcp_rate_check_app_limited,tcp_current_mss,tcp_sync_mss,tcp_sndbuf_expand,tcp_tso_should_defer(tx);tcp_update_pacing_rate,tcp_clean_rtx_queue(rx)
u32 window_clamp read_mostly read_write tcp_rcv_space_adjust,__tcp_select_window
-u32 rcv_ssthresh read_mostly - __tcp_select_window
+u32 rcv_ssthresh read_mostly __tcp_select_window
u8 scaling_ratio read_mostly read_mostly tcp_win_from_space
-struct tcp_rack
-u16 advmss - read_mostly tcp_rcv_space_adjust
-u8 compressed_ack
-u8:2 dup_ack_counter
-u8:1 tlp_retrans
+struct tcp_rack
+u16 advmss read_mostly tcp_rcv_space_adjust
+u8 compressed_ack
+u8:2 dup_ack_counter
+u8:1 tlp_retrans
u8:1 tcp_usec_ts read_mostly read_mostly
-u32 chrono_start read_write - tcp_chrono_start/stop(tcp_write_xmit,tcp_cwnd_validate,tcp_send_syn_data)
-u32[3] chrono_stat read_write - tcp_chrono_start/stop(tcp_write_xmit,tcp_cwnd_validate,tcp_send_syn_data)
-u8:2 chrono_type read_write - tcp_chrono_start/stop(tcp_write_xmit,tcp_cwnd_validate,tcp_send_syn_data)
-u8:1 rate_app_limited - read_write tcp_rate_gen
-u8:1 fastopen_connect
-u8:1 fastopen_no_cookie
-u8:1 is_sack_reneg - read_mostly tcp_skb_entail,tcp_ack
-u8:2 fastopen_client_fail
-u8:4 nonagle read_write - tcp_skb_entail,tcp_push_pending_frames
-u8:1 thin_lto
-u8:1 recvmsg_inq
-u8:1 repair read_mostly - tcp_write_xmit
-u8:1 frto
-u8 repair_queue - -
-u8:2 save_syn
-u8:1 syn_data
-u8:1 syn_fastopen
-u8:1 syn_fastopen_exp
-u8:1 syn_fastopen_ch
-u8:1 syn_data_acked
-u8:1 is_cwnd_limited read_mostly - tcp_cwnd_validate,tcp_is_cwnd_limited
-u32 tlp_high_seq - read_mostly tcp_ack
-u32 tcp_tx_delay
-u64 tcp_wstamp_ns read_write - tcp_pacing_check,tcp_tso_should_defer,tcp_update_skb_after_send
+u32 chrono_start read_write tcp_chrono_start/stop(tcp_write_xmit,tcp_cwnd_validate,tcp_send_syn_data)
+u32[3] chrono_stat read_write tcp_chrono_start/stop(tcp_write_xmit,tcp_cwnd_validate,tcp_send_syn_data)
+u8:2 chrono_type read_write tcp_chrono_start/stop(tcp_write_xmit,tcp_cwnd_validate,tcp_send_syn_data)
+u8:1 rate_app_limited read_write tcp_rate_gen
+u8:1 fastopen_connect
+u8:1 fastopen_no_cookie
+u8:1 is_sack_reneg read_mostly tcp_skb_entail,tcp_ack
+u8:2 fastopen_client_fail
+u8:4 nonagle read_write tcp_skb_entail,tcp_push_pending_frames
+u8:1 thin_lto
+u8:1 recvmsg_inq
+u8:1 repair read_mostly tcp_write_xmit
+u8:1 frto
+u8 repair_queue
+u8:2 save_syn
+u8:1 syn_data
+u8:1 syn_fastopen
+u8:1 syn_fastopen_exp
+u8:1 syn_fastopen_ch
+u8:1 syn_data_acked
+u8:1 is_cwnd_limited read_mostly tcp_cwnd_validate,tcp_is_cwnd_limited
+u32 tlp_high_seq read_mostly tcp_ack
+u32 tcp_tx_delay
+u64 tcp_wstamp_ns read_write tcp_pacing_check,tcp_tso_should_defer,tcp_update_skb_after_send
u64 tcp_clock_cache read_write read_write tcp_mstamp_refresh(tcp_write_xmit/tcp_rcv_space_adjust),__tcp_transmit_skb,tcp_tso_should_defer;timer
u64 tcp_mstamp read_write read_write tcp_mstamp_refresh(tcp_write_xmit/tcp_rcv_space_adjust)(tx);tcp_rcv_space_adjust,tcp_rate_gen,tcp_clean_rtx_queue,tcp_ack_update_rtt/tcp_time_stamp(rx);timer
u32 srtt_us read_mostly read_write tcp_tso_should_defer(tx);tcp_update_pacing_rate,__tcp_set_rto,tcp_rtt_estimator(rx)
-u32 mdev_us read_write - tcp_rtt_estimator
-u32 mdev_max_us
-u32 rttvar_us - read_mostly __tcp_set_rto
+u32 mdev_us read_write tcp_rtt_estimator
+u32 mdev_max_us
+u32 rttvar_us read_mostly __tcp_set_rto
u32 rtt_seq read_write tcp_rtt_estimator
-struct_minmax rtt_min - read_mostly tcp_min_rtt/tcp_rate_gen,tcp_min_rtttcp_update_rtt_min
+struct minmax rtt_min read_mostly tcp_min_rtt/tcp_rate_gen,tcp_min_rtttcp_update_rtt_min
u32 packets_out read_write read_write tcp_packets_in_flight(tx/rx);tcp_slow_start_after_idle_check,tcp_nagle_check,tcp_rate_skb_sent,tcp_event_new_data_sent,tcp_cwnd_validate,tcp_write_xmit(tx);tcp_ack,tcp_clean_rtx_queue,tcp_update_pacing_rate(rx)
-u32 retrans_out - read_mostly tcp_packets_in_flight,tcp_rate_check_app_limited
-u32 max_packets_out - read_write tcp_cwnd_validate
-u32 cwnd_usage_seq - read_write tcp_cwnd_validate
-u16 urg_data - read_mostly tcp_fast_path_check
-u8 ecn_flags read_write - tcp_ecn_send
-u8 keepalive_probes
-u32 reordering read_mostly - tcp_sndbuf_expand
-u32 reord_seen
+u32 retrans_out read_mostly tcp_packets_in_flight,tcp_rate_check_app_limited
+u32 max_packets_out read_write tcp_cwnd_validate
+u32 cwnd_usage_seq read_write tcp_cwnd_validate
+u16 urg_data read_mostly tcp_fast_path_check
+u8 ecn_flags read_write tcp_ecn_send
+u8 keepalive_probes
+u32 reordering read_mostly tcp_sndbuf_expand
+u32 reord_seen
u32 snd_up read_write read_mostly tcp_mark_urg,tcp_urg_mode,__tcp_transmit_skb(tx);tcp_clean_rtx_queue(rx)
-struct_tcp_options_received rx_opt read_mostly read_write tcp_established_options(tx);tcp_fast_path_on,tcp_ack_update_window,tcp_is_sack,tcp_data_queue,tcp_rcv_established,tcp_ack_update_rtt(rx)
-u32 snd_ssthresh - read_mostly tcp_update_pacing_rate
+struct tcp_options_received rx_opt read_mostly read_write tcp_established_options(tx);tcp_fast_path_on,tcp_ack_update_window,tcp_is_sack,tcp_data_queue,tcp_rcv_established,tcp_ack_update_rtt(rx)
+u32 snd_ssthresh read_mostly tcp_update_pacing_rate
u32 snd_cwnd read_mostly read_mostly tcp_snd_cwnd,tcp_rate_check_app_limited,tcp_tso_should_defer(tx);tcp_update_pacing_rate
-u32 snd_cwnd_cnt
-u32 snd_cwnd_clamp
-u32 snd_cwnd_used
-u32 snd_cwnd_stamp
-u32 prior_cwnd
-u32 prr_delivered
+u32 snd_cwnd_cnt
+u32 snd_cwnd_clamp
+u32 snd_cwnd_used
+u32 snd_cwnd_stamp
+u32 prior_cwnd
+u32 prr_delivered
u32 prr_out read_mostly read_mostly tcp_rate_skb_sent,tcp_newly_delivered(tx);tcp_ack,tcp_rate_gen,tcp_clean_rtx_queue(rx)
u32 delivered read_mostly read_write tcp_rate_skb_sent, tcp_newly_delivered(tx);tcp_ack, tcp_rate_gen, tcp_clean_rtx_queue (rx)
u32 delivered_ce read_mostly read_write tcp_rate_skb_sent(tx);tcp_rate_gen(rx)
-u32 lost - read_mostly tcp_ack
+u32 lost read_mostly tcp_ack
u32 app_limited read_write read_mostly tcp_rate_check_app_limited,tcp_rate_skb_sent(tx);tcp_rate_gen(rx)
-u64 first_tx_mstamp read_write - tcp_rate_skb_sent
-u64 delivered_mstamp read_write - tcp_rate_skb_sent
-u32 rate_delivered - read_mostly tcp_rate_gen
-u32 rate_interval_us - read_mostly rate_delivered,rate_app_limited
+u64 first_tx_mstamp read_write tcp_rate_skb_sent
+u64 delivered_mstamp read_write tcp_rate_skb_sent
+u32 rate_delivered read_mostly tcp_rate_gen
+u32 rate_interval_us read_mostly rate_delivered,rate_app_limited
u32 rcv_wnd read_write read_mostly tcp_select_window,tcp_receive_window,tcp_fast_path_check
-u32 write_seq read_write - tcp_rate_check_app_limited,tcp_write_queue_empty,tcp_skb_entail,forced_push,tcp_mark_push
-u32 notsent_lowat read_mostly - tcp_stream_memory_free
-u32 pushed_seq read_write - tcp_mark_push,forced_push
+u32 write_seq read_write tcp_rate_check_app_limited,tcp_write_queue_empty,tcp_skb_entail,forced_push,tcp_mark_push
+u32 notsent_lowat read_mostly tcp_stream_memory_free
+u32 pushed_seq read_write tcp_mark_push,forced_push
u32 lost_out read_mostly read_mostly tcp_left_out(tx);tcp_packets_in_flight(tx/rx);tcp_rate_check_app_limited(rx)
u32 sacked_out read_mostly read_mostly tcp_left_out(tx);tcp_packets_in_flight(tx/rx);tcp_clean_rtx_queue(rx)
-struct_hrtimer pacing_timer
-struct_hrtimer compressed_ack_timer
-struct_sk_buff* lost_skb_hint read_mostly tcp_clean_rtx_queue
-struct_sk_buff* retransmit_skb_hint read_mostly - tcp_clean_rtx_queue
-struct_rb_root out_of_order_queue - read_mostly tcp_data_queue,tcp_fast_path_check
-struct_sk_buff* ooo_last_skb
-struct_tcp_sack_block[1] duplicate_sack
-struct_tcp_sack_block[4] selective_acks
-struct_tcp_sack_block[4] recv_sack_cache
-struct_sk_buff* highest_sack read_write - tcp_event_new_data_sent
-int lost_cnt_hint
-u32 prior_ssthresh
-u32 high_seq
-u32 retrans_stamp
-u32 undo_marker
-int undo_retrans
-u64 bytes_retrans
-u32 total_retrans
-u32 rto_stamp
-u16 total_rto
-u16 total_rto_recoveries
-u32 total_rto_time
-u32 urg_seq - -
-unsigned_int keepalive_time
-unsigned_int keepalive_intvl
-int linger2
-u8 bpf_sock_ops_cb_flags
-u8:1 bpf_chg_cc_inprogress
-u16 timeout_rehash
-u32 rcv_ooopack
-u32 rcv_rtt_last_tsecr
-struct rcv_rtt_est - read_write tcp_rcv_space_adjust,tcp_rcv_established
-struct rcvq_space - read_write tcp_rcv_space_adjust
-struct mtu_probe
-u32 plb_rehash
-u32 mtu_info
-bool is_mptcp
-bool smc_hs_congested
-bool syn_smc
-struct_tcp_sock_af_ops* af_specific
-struct_tcp_md5sig_info* md5sig_info
-struct_tcp_fastopen_request* fastopen_req
-struct_request_sock* fastopen_rsk
-struct_saved_syn* saved_syn \ No newline at end of file
+struct hrtimer pacing_timer
+struct hrtimer compressed_ack_timer
+struct sk_buff* lost_skb_hint read_mostly tcp_clean_rtx_queue
+struct sk_buff* retransmit_skb_hint read_mostly tcp_clean_rtx_queue
+struct rb_root out_of_order_queue read_mostly tcp_data_queue,tcp_fast_path_check
+struct sk_buff* ooo_last_skb
+struct tcp_sack_block[1] duplicate_sack
+struct tcp_sack_block[4] selective_acks
+struct tcp_sack_block[4] recv_sack_cache
+struct sk_buff* highest_sack read_write tcp_event_new_data_sent
+int lost_cnt_hint
+u32 prior_ssthresh
+u32 high_seq
+u32 retrans_stamp
+u32 undo_marker
+int undo_retrans
+u64 bytes_retrans
+u32 total_retrans
+u32 rto_stamp
+u16 total_rto
+u16 total_rto_recoveries
+u32 total_rto_time
+u32 urg_seq
+unsigned_int keepalive_time
+unsigned_int keepalive_intvl
+int linger2
+u8 bpf_sock_ops_cb_flags
+u8:1 bpf_chg_cc_inprogress
+u16 timeout_rehash
+u32 rcv_ooopack
+u32 rcv_rtt_last_tsecr
+struct rcv_rtt_est read_write tcp_rcv_space_adjust,tcp_rcv_established
+struct rcvq_space read_write tcp_rcv_space_adjust
+struct mtu_probe
+u32 plb_rehash
+u32 mtu_info
+bool is_mptcp
+bool smc_hs_congested
+bool syn_smc
+struct tcp_sock_af_ops* af_specific
+struct tcp_md5sig_info* md5sig_info
+struct tcp_fastopen_request* fastopen_req
+struct request_sock* fastopen_rsk
+struct saved_syn* saved_syn
+============================= ======================= =================== =================== ==================================================================================================================================================================================================================
diff --git a/Documentation/networking/net_dim.rst b/Documentation/networking/net_dim.rst
index 3bed9fd95336..4377998e6826 100644
--- a/Documentation/networking/net_dim.rst
+++ b/Documentation/networking/net_dim.rst
@@ -156,7 +156,7 @@ usage is not complete but it should make the outline of the usage clear.
my_entity->bytes,
&dim_sample);
/* Call net DIM */
- net_dim(&my_entity->dim, dim_sample);
+ net_dim(&my_entity->dim, &dim_sample);
...
}
@@ -169,6 +169,48 @@ usage is not complete but it should make the outline of the usage clear.
...
}
+
+Tuning DIM
+==========
+
+Net DIM serves a range of network devices and delivers excellent acceleration
+benefits. Yet, it has been observed that some preset configurations of DIM may
+not align seamlessly with the varying specifications of network devices, and
+this discrepancy has been identified as a factor to the suboptimal performance
+outcomes of DIM-enabled network devices, related to a mismatch in profiles.
+
+To address this issue, Net DIM introduces a per-device control to modify and
+access a device's ``rx-profile`` and ``tx-profile`` parameters:
+Assume that the target network device is named ethx, and ethx only declares
+support for RX profile setting and supports modification of ``usec`` field
+and ``pkts`` field (See the data structure:
+:c:type:`struct dim_cq_moder <dim_cq_moder>`).
+
+You can use ethtool to modify the current RX DIM profile where all
+values are 64::
+
+ $ ethtool -C ethx rx-profile 1,1,n_2,2,n_3,n,n_n,4,n_n,n,n
+
+``n`` means do not modify this field, and ``_`` separates structure
+elements of the profile array.
+
+Querying the current profiles using::
+
+ $ ethtool -c ethx
+ ...
+ rx-profile:
+ {.usec = 1, .pkts = 1, .comps = n/a,},
+ {.usec = 2, .pkts = 2, .comps = n/a,},
+ {.usec = 3, .pkts = 64, .comps = n/a,},
+ {.usec = 64, .pkts = 4, .comps = n/a,},
+ {.usec = 64, .pkts = 64, .comps = n/a,}
+ tx-profile: n/a
+
+If the network device does not support specific fields of DIM profiles,
+the corresponding ``n/a`` will display. If the ``n/a`` field is being
+modified, error messages will be reported.
+
+
Dynamic Interrupt Moderation (DIM) library API
==============================================
diff --git a/Documentation/networking/netconsole.rst b/Documentation/networking/netconsole.rst
index d55c2a22ec7a..94c4680fdf3e 100644
--- a/Documentation/networking/netconsole.rst
+++ b/Documentation/networking/netconsole.rst
@@ -124,7 +124,7 @@ To remove a target::
The interface exposes these parameters of a netconsole target to userspace:
- ============== ================================= ============
+ =============== ================================= ============
enabled Is this target currently enabled? (read-write)
extended Extended mode enabled (read-write)
release Prepend kernel release to message (read-write)
@@ -135,7 +135,8 @@ The interface exposes these parameters of a netconsole target to userspace:
remote_ip Remote agent's IP address (read-write)
local_mac Local interface's MAC address (read-only)
remote_mac Remote agent's MAC address (read-write)
- ============== ================================= ============
+ transmit_errors Number of packet send errors (read-only)
+ =============== ================================= ============
The "enabled" attribute is also used to control whether the parameters of
a target can be updated or not -- you can modify the parameters of only
diff --git a/Documentation/networking/netdev-features.rst b/Documentation/networking/netdev-features.rst
index d7b15bb64deb..5014f7cc1398 100644
--- a/Documentation/networking/netdev-features.rst
+++ b/Documentation/networking/netdev-features.rst
@@ -139,21 +139,6 @@ chained skbs (skb->next/prev list).
Features contained in NETIF_F_SOFT_FEATURES are features of networking
stack. Driver should not change behaviour based on them.
- * LLTX driver (deprecated for hardware drivers)
-
-NETIF_F_LLTX is meant to be used by drivers that don't need locking at all,
-e.g. software tunnels.
-
-This is also used in a few legacy drivers that implement their
-own locking, don't use it for new (hardware) drivers.
-
- * netns-local device
-
-NETIF_F_NETNS_LOCAL is set for devices that are not allowed to move between
-network namespaces (e.g. loopback).
-
-Don't use it in drivers.
-
* VLAN challenged
NETIF_F_VLAN_CHALLENGED should be set for devices which can't cope with VLAN
diff --git a/Documentation/networking/netdevices.rst b/Documentation/networking/netdevices.rst
index c2476917a6c3..1d37038e9fbe 100644
--- a/Documentation/networking/netdevices.rst
+++ b/Documentation/networking/netdevices.rst
@@ -258,11 +258,11 @@ ndo_get_stats:
ndo_start_xmit:
Synchronization: __netif_tx_lock spinlock.
- When the driver sets NETIF_F_LLTX in dev->features this will be
+ When the driver sets dev->lltx this will be
called without holding netif_tx_lock. In this case the driver
has to lock by itself when needed.
The locking there should also properly protect against
- set_rx_mode. WARNING: use of NETIF_F_LLTX is deprecated.
+ set_rx_mode. WARNING: use of dev->lltx is deprecated.
Don't use it for new drivers.
Context: Process with BHs disabled or BH (timer),
@@ -297,3 +297,13 @@ napi->poll:
Context:
softirq
will be called with interrupts disabled by netconsole.
+
+NETDEV_INTERNAL symbol namespace
+================================
+
+Symbols exported as NETDEV_INTERNAL can only be used in networking
+core and drivers which exclusively flow via the main networking list and trees.
+Note that the inverse is not true, most symbols outside of NETDEV_INTERNAL
+are not expected to be used by random code outside netdev either.
+Symbols may lack the designation because they predate the namespaces,
+or simply due to an oversight.
diff --git a/Documentation/networking/netlink_spec/readme.txt b/Documentation/networking/netlink_spec/readme.txt
index 6763f99d216c..030b44aca4e6 100644
--- a/Documentation/networking/netlink_spec/readme.txt
+++ b/Documentation/networking/netlink_spec/readme.txt
@@ -1,4 +1,4 @@
SPDX-License-Identifier: GPL-2.0
This file is populated during the build of the documentation (htmldocs) by the
-tools/net/ynl/ynl-gen-rst.py script.
+tools/net/ynl/pyynl/ynl_gen_rst.py script.
diff --git a/Documentation/networking/netmem.rst b/Documentation/networking/netmem.rst
new file mode 100644
index 000000000000..7de21ddb5412
--- /dev/null
+++ b/Documentation/networking/netmem.rst
@@ -0,0 +1,79 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================================
+Netmem Support for Network Drivers
+==================================
+
+This document outlines the requirements for network drivers to support netmem,
+an abstract memory type that enables features like device memory TCP. By
+supporting netmem, drivers can work with various underlying memory types
+with little to no modification.
+
+Benefits of Netmem :
+
+* Flexibility: Netmem can be backed by different memory types (e.g., struct
+ page, DMA-buf), allowing drivers to support various use cases such as device
+ memory TCP.
+* Future-proof: Drivers with netmem support are ready for upcoming
+ features that rely on it.
+* Simplified Development: Drivers interact with a consistent API,
+ regardless of the underlying memory implementation.
+
+Driver Requirements
+===================
+
+1. The driver must support page_pool.
+
+2. The driver must support the tcp-data-split ethtool option.
+
+3. The driver must use the page_pool netmem APIs for payload memory. The netmem
+ APIs currently 1-to-1 correspond with page APIs. Conversion to netmem should
+ be achievable by switching the page APIs to netmem APIs and tracking memory
+ via netmem_refs in the driver rather than struct page * :
+
+ - page_pool_alloc -> page_pool_alloc_netmem
+ - page_pool_get_dma_addr -> page_pool_get_dma_addr_netmem
+ - page_pool_put_page -> page_pool_put_netmem
+
+ Not all page APIs have netmem equivalents at the moment. If your driver
+ relies on a missing netmem API, feel free to add and propose to netdev@, or
+ reach out to the maintainers and/or almasrymina@google.com for help adding
+ the netmem API.
+
+4. The driver must use the following PP_FLAGS:
+
+ - PP_FLAG_DMA_MAP: netmem is not dma-mappable by the driver. The driver
+ must delegate the dma mapping to the page_pool, which knows when
+ dma-mapping is (or is not) appropriate.
+ - PP_FLAG_DMA_SYNC_DEV: netmem dma addr is not necessarily dma-syncable
+ by the driver. The driver must delegate the dma syncing to the page_pool,
+ which knows when dma-syncing is (or is not) appropriate.
+ - PP_FLAG_ALLOW_UNREADABLE_NETMEM. The driver must specify this flag iff
+ tcp-data-split is enabled.
+
+5. The driver must not assume the netmem is readable and/or backed by pages.
+ The netmem returned by the page_pool may be unreadable, in which case
+ netmem_address() will return NULL. The driver must correctly handle
+ unreadable netmem, i.e. don't attempt to handle its contents when
+ netmem_address() is NULL.
+
+ Ideally, drivers should not have to check the underlying netmem type via
+ helpers like netmem_is_net_iov() or convert the netmem to any of its
+ underlying types via netmem_to_page() or netmem_to_net_iov(). In most cases,
+ netmem or page_pool helpers that abstract this complexity are provided
+ (and more can be added).
+
+6. The driver must use page_pool_dma_sync_netmem_for_cpu() in lieu of
+ dma_sync_single_range_for_cpu(). For some memory providers, dma_syncing for
+ CPU will be done by the page_pool, for others (particularly dmabuf memory
+ provider), dma syncing for CPU is the responsibility of the userspace using
+ dmabuf APIs. The driver must delegate the entire dma-syncing operation to
+ the page_pool which will do it correctly.
+
+7. Avoid implementing driver-specific recycling on top of the page_pool. Drivers
+ cannot hold onto a struct page to do their own recycling as the netmem may
+ not be backed by a struct page. However, you may hold onto a page_pool
+ reference with page_pool_fragment_netmem() or page_pool_ref_netmem() for
+ that purpose, but be mindful that some netmem types might have longer
+ circulation times, such as when userspace holds a reference in zerocopy
+ scenarios.
diff --git a/Documentation/networking/nf_conntrack-sysctl.rst b/Documentation/networking/nf_conntrack-sysctl.rst
index c383a394c665..238b66d0e059 100644
--- a/Documentation/networking/nf_conntrack-sysctl.rst
+++ b/Documentation/networking/nf_conntrack-sysctl.rst
@@ -222,11 +222,11 @@ nf_flowtable_tcp_timeout - INTEGER (seconds)
Control offload timeout for tcp connections.
TCP connections may be offloaded from nf conntrack to nf flow table.
- Once aged, the connection is returned to nf conntrack with tcp pickup timeout.
+ Once aged, the connection is returned to nf conntrack.
nf_flowtable_udp_timeout - INTEGER (seconds)
default 30
Control offload timeout for udp connections.
UDP connections may be offloaded from nf conntrack to nf flow table.
- Once aged, the connection is returned to nf conntrack with udp pickup timeout.
+ Once aged, the connection is returned to nf conntrack.
diff --git a/Documentation/networking/oa-tc6-framework.rst b/Documentation/networking/oa-tc6-framework.rst
new file mode 100644
index 000000000000..fe2aabde923a
--- /dev/null
+++ b/Documentation/networking/oa-tc6-framework.rst
@@ -0,0 +1,497 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+=========================================================================
+OPEN Alliance 10BASE-T1x MAC-PHY Serial Interface (TC6) Framework Support
+=========================================================================
+
+Introduction
+------------
+
+The IEEE 802.3cg project defines two 10 Mbit/s PHYs operating over a
+single pair of conductors. The 10BASE-T1L (Clause 146) is a long reach
+PHY supporting full duplex point-to-point operation over 1 km of single
+balanced pair of conductors. The 10BASE-T1S (Clause 147) is a short reach
+PHY supporting full / half duplex point-to-point operation over 15 m of
+single balanced pair of conductors, or half duplex multidrop bus
+operation over 25 m of single balanced pair of conductors.
+
+Furthermore, the IEEE 802.3cg project defines the new Physical Layer
+Collision Avoidance (PLCA) Reconciliation Sublayer (Clause 148) meant to
+provide improved determinism to the CSMA/CD media access method. PLCA
+works in conjunction with the 10BASE-T1S PHY operating in multidrop mode.
+
+The aforementioned PHYs are intended to cover the low-speed / low-cost
+applications in industrial and automotive environment. The large number
+of pins (16) required by the MII interface, which is specified by the
+IEEE 802.3 in Clause 22, is one of the major cost factors that need to be
+addressed to fulfil this objective.
+
+The MAC-PHY solution integrates an IEEE Clause 4 MAC and a 10BASE-T1x PHY
+exposing a low pin count Serial Peripheral Interface (SPI) to the host
+microcontroller. This also enables the addition of Ethernet functionality
+to existing low-end microcontrollers which do not integrate a MAC
+controller.
+
+Overview
+--------
+
+The MAC-PHY is specified to carry both data (Ethernet frames) and control
+(register access) transactions over a single full-duplex serial peripheral
+interface.
+
+Protocol Overview
+-----------------
+
+Two types of transactions are defined in the protocol: data transactions
+for Ethernet frame transfers and control transactions for register
+read/write transfers. A chunk is the basic element of data transactions
+and is composed of 4 bytes of overhead plus 64 bytes of payload size for
+each chunk. Ethernet frames are transferred over one or more data chunks.
+Control transactions consist of one or more register read/write control
+commands.
+
+SPI transactions are initiated by the SPI host with the assertion of CSn
+low to the MAC-PHY and ends with the deassertion of CSn high. In between
+each SPI transaction, the SPI host may need time for additional
+processing and to setup the next SPI data or control transaction.
+
+SPI data transactions consist of an equal number of transmit (TX) and
+receive (RX) chunks. Chunks in both transmit and receive directions may
+or may not contain valid frame data independent from each other, allowing
+for the simultaneous transmission and reception of different length
+frames.
+
+Each transmit data chunk begins with a 32-bit data header followed by a
+data chunk payload on MOSI. The data header indicates whether transmit
+frame data is present and provides the information to determine which
+bytes of the payload contain valid frame data.
+
+In parallel, receive data chunks are received on MISO. Each receive data
+chunk consists of a data chunk payload ending with a 32-bit data footer.
+The data footer indicates if there is receive frame data present within
+the payload or not and provides the information to determine which bytes
+of the payload contain valid frame data.
+
+Reference
+---------
+
+10BASE-T1x MAC-PHY Serial Interface Specification,
+
+Link: https://opensig.org/download/document/OPEN_Alliance_10BASET1x_MAC-PHY_Serial_Interface_V1.1.pdf
+
+Hardware Architecture
+---------------------
+
+.. code-block:: none
+
+ +----------+ +-------------------------------------+
+ | | | MAC-PHY |
+ | |<---->| +-----------+ +-------+ +-------+ |
+ | SPI Host | | | SPI Slave | | MAC | | PHY | |
+ | | | +-----------+ +-------+ +-------+ |
+ +----------+ +-------------------------------------+
+
+Software Architecture
+---------------------
+
+.. code-block:: none
+
+ +----------------------------------------------------------+
+ | Networking Subsystem |
+ +----------------------------------------------------------+
+ / \ / \
+ | |
+ | |
+ \ / |
+ +----------------------+ +-----------------------------+
+ | MAC Driver |<--->| OPEN Alliance TC6 Framework |
+ +----------------------+ +-----------------------------+
+ / \ / \
+ | |
+ | |
+ | \ /
+ +----------------------------------------------------------+
+ | SPI Subsystem |
+ +----------------------------------------------------------+
+ / \
+ |
+ |
+ \ /
+ +----------------------------------------------------------+
+ | 10BASE-T1x MAC-PHY Device |
+ +----------------------------------------------------------+
+
+Implementation
+--------------
+
+MAC Driver
+~~~~~~~~~~
+
+- Probed by SPI subsystem.
+
+- Initializes OA TC6 framework for the MAC-PHY.
+
+- Registers and configures the network device.
+
+- Sends the tx ethernet frames from n/w subsystem to OA TC6 framework.
+
+OPEN Alliance TC6 Framework
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+- Initializes PHYLIB interface.
+
+- Registers mac-phy interrupt.
+
+- Performs mac-phy register read/write operation using the control
+ transaction protocol specified in the OPEN Alliance 10BASE-T1x MAC-PHY
+ Serial Interface specification.
+
+- Performs Ethernet frames transaction using the data transaction protocol
+ for Ethernet frames specified in the OPEN Alliance 10BASE-T1x MAC-PHY
+ Serial Interface specification.
+
+- Forwards the received Ethernet frame from 10Base-T1x MAC-PHY to n/w
+ subsystem.
+
+Data Transaction
+~~~~~~~~~~~~~~~~
+
+The Ethernet frames that are typically transferred from the SPI host to
+the MAC-PHY will be converted into multiple transmit data chunks. Each
+transmit data chunk will have a 4 bytes header which contains the
+information needed to determine the validity and the location of the
+transmit frame data within the 64 bytes data chunk payload.
+
+.. code-block:: none
+
+ +---------------------------------------------------+
+ | Tx Chunk |
+ | +---------------------------+ +----------------+ | MOSI
+ | | 64 bytes chunk payload | | 4 bytes header | |------------>
+ | +---------------------------+ +----------------+ |
+ +---------------------------------------------------+
+
+4 bytes header contains the below fields,
+
+DNC (Bit 31) - Data-Not-Control flag. This flag specifies the type of SPI
+ transaction. For TX data chunks, this bit shall be ’1’.
+ 0 - Control command
+ 1 - Data chunk
+
+SEQ (Bit 30) - Data Chunk Sequence. This bit is used to indicate an
+ even/odd transmit data chunk sequence to the MAC-PHY.
+
+NORX (Bit 29) - No Receive flag. The SPI host may set this bit to prevent
+ the MAC-PHY from conveying RX data on the MISO for the
+ current chunk (DV = 0 in the footer), indicating that the
+ host would not process it. Typically, the SPI host should
+ set NORX = 0 indicating that it will accept and process
+ any receive frame data within the current chunk.
+
+RSVD (Bit 28..24) - Reserved: All reserved bits shall be ‘0’.
+
+VS (Bit 23..22) - Vendor Specific. These bits are implementation specific.
+ If the MAC-PHY does not implement these bits, the host
+ shall set them to ‘0’.
+
+DV (Bit 21) - Data Valid flag. The SPI host uses this bit to indicate
+ whether the current chunk contains valid transmit frame data
+ (DV = 1) or not (DV = 0). When ‘0’, the MAC-PHY ignores the
+ chunk payload. Note that the receive path is unaffected by
+ the setting of the DV bit in the data header.
+
+SV (Bit 20) - Start Valid flag. The SPI host shall set this bit when the
+ beginning of an Ethernet frame is present in the current
+ transmit data chunk payload. Otherwise, this bit shall be
+ zero. This bit is not to be confused with the Start-of-Frame
+ Delimiter (SFD) byte described in IEEE 802.3 [2].
+
+SWO (Bit 19..16) - Start Word Offset. When SV = 1, this field shall
+ contain the 32-bit word offset into the transmit data
+ chunk payload that points to the start of a new
+ Ethernet frame to be transmitted. The host shall write
+ this field as zero when SV = 0.
+
+RSVD (Bit 15) - Reserved: All reserved bits shall be ‘0’.
+
+EV (Bit 14) - End Valid flag. The SPI host shall set this bit when the end
+ of an Ethernet frame is present in the current transmit data
+ chunk payload. Otherwise, this bit shall be zero.
+
+EBO (Bit 13..8) - End Byte Offset. When EV = 1, this field shall contain
+ the byte offset into the transmit data chunk payload
+ that points to the last byte of the Ethernet frame to
+ transmit. This field shall be zero when EV = 0.
+
+TSC (Bit 7..6) - Timestamp Capture. Request a timestamp capture when the
+ frame is transmitted onto the network.
+ 00 - Do not capture a timestamp
+ 01 - Capture timestamp into timestamp capture register A
+ 10 - Capture timestamp into timestamp capture register B
+ 11 - Capture timestamp into timestamp capture register C
+
+RSVD (Bit 5..1) - Reserved: All reserved bits shall be ‘0’.
+
+P (Bit 0) - Parity. Parity bit calculated over the transmit data header.
+ Method used is odd parity.
+
+The number of buffers available in the MAC-PHY to store the incoming
+transmit data chunk payloads is represented as transmit credits. The
+available transmit credits in the MAC-PHY can be read either from the
+Buffer Status Register or footer (Refer below for the footer info)
+received from the MAC-PHY. The SPI host should not write more data chunks
+than the available transmit credits as this will lead to transmit buffer
+overflow error.
+
+In case the previous data footer had no transmit credits available and
+once the transmit credits become available for transmitting transmit data
+chunks, the MAC-PHY interrupt is asserted to SPI host. On reception of the
+first data header this interrupt will be deasserted and the received
+footer for the first data chunk will have the transmit credits available
+information.
+
+The Ethernet frames that are typically transferred from MAC-PHY to SPI
+host will be sent as multiple receive data chunks. Each receive data
+chunk will have 64 bytes of data chunk payload followed by 4 bytes footer
+which contains the information needed to determine the validity and the
+location of the receive frame data within the 64 bytes data chunk payload.
+
+.. code-block:: none
+
+ +---------------------------------------------------+
+ | Rx Chunk |
+ | +----------------+ +---------------------------+ | MISO
+ | | 4 bytes footer | | 64 bytes chunk payload | |------------>
+ | +----------------+ +---------------------------+ |
+ +---------------------------------------------------+
+
+4 bytes footer contains the below fields,
+
+EXST (Bit 31) - Extended Status. This bit is set when any bit in the
+ STATUS0 or STATUS1 registers are set and not masked.
+
+HDRB (Bit 30) - Received Header Bad. When set, indicates that the MAC-PHY
+ received a control or data header with a parity error.
+
+SYNC (Bit 29) - Configuration Synchronized flag. This bit reflects the
+ state of the SYNC bit in the CONFIG0 configuration
+ register (see Table 12). A zero indicates that the MAC-PHY
+ configuration may not be as expected by the SPI host.
+ Following configuration, the SPI host sets the
+ corresponding bitin the configuration register which is
+ reflected in this field.
+
+RCA (Bit 28..24) - Receive Chunks Available. The RCA field indicates to
+ the SPI host the minimum number of additional receive
+ data chunks of frame data that are available for
+ reading beyond the current receive data chunk. This
+ field is zero when there is no receive frame data
+ pending in the MAC-PHY’s buffer for reading.
+
+VS (Bit 23..22) - Vendor Specific. These bits are implementation specific.
+ If not implemented, the MAC-PHY shall set these bits to
+ ‘0’.
+
+DV (Bit 21) - Data Valid flag. The MAC-PHY uses this bit to indicate
+ whether the current receive data chunk contains valid
+ receive frame data (DV = 1) or not (DV = 0). When ‘0’, the
+ SPI host shall ignore the chunk payload.
+
+SV (Bit 20) - Start Valid flag. The MAC-PHY sets this bit when the current
+ chunk payload contains the start of an Ethernet frame.
+ Otherwise, this bit is zero. The SV bit is not to be
+ confused with the Start-of-Frame Delimiter (SFD) byte
+ described in IEEE 802.3 [2].
+
+SWO (Bit 19..16) - Start Word Offset. When SV = 1, this field contains the
+ 32-bit word offset into the receive data chunk payload
+ containing the first byte of a new received Ethernet
+ frame. When a receive timestamp has been added to the
+ beginning of the received Ethernet frame (RTSA = 1)
+ then SWO points to the most significant byte of the
+ timestamp. This field will be zero when SV = 0.
+
+FD (Bit 15) - Frame Drop. When set, this bit indicates that the MAC has
+ detected a condition for which the SPI host should drop the
+ received Ethernet frame. This bit is only valid at the end
+ of a received Ethernet frame (EV = 1) and shall be zero at
+ all other times.
+
+EV (Bit 14) - End Valid flag. The MAC-PHY sets this bit when the end of a
+ received Ethernet frame is present in this receive data
+ chunk payload.
+
+EBO (Bit 13..8) - End Byte Offset: When EV = 1, this field contains the
+ byte offset into the receive data chunk payload that
+ locates the last byte of the received Ethernet frame.
+ This field is zero when EV = 0.
+
+RTSA (Bit 7) - Receive Timestamp Added. This bit is set when a 32-bit or
+ 64-bit timestamp has been added to the beginning of the
+ received Ethernet frame. The MAC-PHY shall set this bit to
+ zero when SV = 0.
+
+RTSP (Bit 6) - Receive Timestamp Parity. Parity bit calculated over the
+ 32-bit/64-bit timestamp added to the beginning of the
+ received Ethernet frame. Method used is odd parity. The
+ MAC-PHY shall set this bit to zero when RTSA = 0.
+
+TXC (Bit 5..1) - Transmit Credits. This field contains the minimum number
+ of transmit data chunks of frame data that the SPI host
+ can write in a single transaction without incurring a
+ transmit buffer overflow error.
+
+P (Bit 0) - Parity. Parity bit calculated over the receive data footer.
+ Method used is odd parity.
+
+SPI host will initiate the data receive transaction based on the receive
+chunks available in the MAC-PHY which is provided in the receive chunk
+footer (RCA - Receive Chunks Available). SPI host will create data invalid
+transmit data chunks (empty chunks) or data valid transmit data chunks in
+case there are valid Ethernet frames to transmit to the MAC-PHY. The
+receive chunks available in MAC-PHY can be read either from the Buffer
+Status Register or footer.
+
+In case the previous data footer had no receive data chunks available and
+once the receive data chunks become available again for reading, the
+MAC-PHY interrupt is asserted to SPI host. On reception of the first data
+header this interrupt will be deasserted and the received footer for the
+first data chunk will have the receive chunks available information.
+
+MAC-PHY Interrupt
+~~~~~~~~~~~~~~~~~
+
+The MAC-PHY interrupt is asserted when the following conditions are met.
+
+Receive chunks available - This interrupt is asserted when the previous
+data footer had no receive data chunks available and once the receive
+data chunks become available for reading. On reception of the first data
+header this interrupt will be deasserted.
+
+Transmit chunk credits available - This interrupt is asserted when the
+previous data footer indicated no transmit credits available and once the
+transmit credits become available for transmitting transmit data chunks.
+On reception of the first data header this interrupt will be deasserted.
+
+Extended status event - This interrupt is asserted when the previous data
+footer indicated no extended status and once the extended event become
+available. In this case the host should read status #0 register to know
+the corresponding error/event. On reception of the first data header this
+interrupt will be deasserted.
+
+Control Transaction
+~~~~~~~~~~~~~~~~~~~
+
+4 bytes control header contains the below fields,
+
+DNC (Bit 31) - Data-Not-Control flag. This flag specifies the type of SPI
+ transaction. For control commands, this bit shall be ‘0’.
+ 0 - Control command
+ 1 - Data chunk
+
+HDRB (Bit 30) - Received Header Bad. When set by the MAC-PHY, indicates
+ that a header was received with a parity error. The SPI
+ host should always clear this bit. The MAC-PHY ignores the
+ HDRB value sent by the SPI host on MOSI.
+
+WNR (Bit 29) - Write-Not-Read. This bit indicates if data is to be written
+ to registers (when set) or read from registers
+ (when clear).
+
+AID (Bit 28) - Address Increment Disable. When clear, the address will be
+ automatically post-incremented by one following each
+ register read or write. When set, address auto increment is
+ disabled allowing successive reads and writes to occur at
+ the same register address.
+
+MMS (Bit 27..24) - Memory Map Selector. This field selects the specific
+ register memory map to access.
+
+ADDR (Bit 23..8) - Address. Address of the first register within the
+ selected memory map to access.
+
+LEN (Bit 7..1) - Length. Specifies the number of registers to read/write.
+ This field is interpreted as the number of registers
+ minus 1 allowing for up to 128 consecutive registers read
+ or written starting at the address specified in ADDR. A
+ length of zero shall read or write a single register.
+
+P (Bit 0) - Parity. Parity bit calculated over the control command header.
+ Method used is odd parity.
+
+Control transactions consist of one or more control commands. Control
+commands are used by the SPI host to read and write registers within the
+MAC-PHY. Each control commands are composed of a 4 bytes control command
+header followed by register write data in case of control write command.
+
+The MAC-PHY ignores the final 4 bytes of data from the SPI host at the end
+of the control write command. The control write command is also echoed
+from the MAC-PHY back to the SPI host to identify which register write
+failed in case of any bus errors. The echoed Control write command will
+have the first 4 bytes unused value to be ignored by the SPI host
+followed by 4 bytes echoed control header followed by echoed register
+write data. Control write commands can write either a single register or
+multiple consecutive registers. When multiple consecutive registers are
+written, the address is automatically post-incremented by the MAC-PHY.
+Writing to any unimplemented or undefined registers shall be ignored and
+yield no effect.
+
+The MAC-PHY ignores all data from the SPI host following the control
+header for the remainder of the control read command. The control read
+command is also echoed from the MAC-PHY back to the SPI host to identify
+which register read is failed in case of any bus errors. The echoed
+Control read command will have the first 4 bytes of unused value to be
+ignored by the SPI host followed by 4 bytes echoed control header followed
+by register read data. Control read commands can read either a single
+register or multiple consecutive registers. When multiple consecutive
+registers are read, the address is automatically post-incremented by the
+MAC-PHY. Reading any unimplemented or undefined registers shall return
+zero.
+
+Device drivers API
+==================
+
+The include/linux/oa_tc6.h defines the following functions:
+
+.. c:function:: struct oa_tc6 *oa_tc6_init(struct spi_device *spi, \
+ struct net_device *netdev)
+
+Initialize OA TC6 lib.
+
+.. c:function:: void oa_tc6_exit(struct oa_tc6 *tc6)
+
+Free allocated OA TC6 lib.
+
+.. c:function:: int oa_tc6_write_register(struct oa_tc6 *tc6, u32 address, \
+ u32 value)
+
+Write a single register in the MAC-PHY.
+
+.. c:function:: int oa_tc6_write_registers(struct oa_tc6 *tc6, u32 address, \
+ u32 value[], u8 length)
+
+Writing multiple consecutive registers starting from @address in the MAC-PHY.
+Maximum of 128 consecutive registers can be written starting at @address.
+
+.. c:function:: int oa_tc6_read_register(struct oa_tc6 *tc6, u32 address, \
+ u32 *value)
+
+Read a single register in the MAC-PHY.
+
+.. c:function:: int oa_tc6_read_registers(struct oa_tc6 *tc6, u32 address, \
+ u32 value[], u8 length)
+
+Reading multiple consecutive registers starting from @address in the MAC-PHY.
+Maximum of 128 consecutive registers can be read starting at @address.
+
+.. c:function:: netdev_tx_t oa_tc6_start_xmit(struct oa_tc6 *tc6, \
+ struct sk_buff *skb);
+
+The transmit Ethernet frame in the skb is or going to be transmitted through
+the MAC-PHY.
+
+.. c:function:: int oa_tc6_zero_align_receive_frame_enable(struct oa_tc6 *tc6);
+
+Zero align receive frame feature can be enabled to align all receive ethernet
+frames data to start at the beginning of any receive data chunk payload with a
+start word offset (SWO) of zero.
diff --git a/Documentation/networking/packet_mmap.rst b/Documentation/networking/packet_mmap.rst
index dca15d15feaf..02370786e77b 100644
--- a/Documentation/networking/packet_mmap.rst
+++ b/Documentation/networking/packet_mmap.rst
@@ -16,7 +16,7 @@ ii) transmit network traffic, or any other that needs raw
Howto can be found at:
- https://sites.google.com/site/packetmmap/
+ https://web.archive.org/web/20220404160947/https://sites.google.com/site/packetmmap/
Please send your comments to
- Ulisses Alonso Camaró <uaca@i.hate.spam.alumni.uv.es>
@@ -166,7 +166,8 @@ As capture, each frame contains two parts::
/* bind socket to eth0 */
bind(this->socket, (struct sockaddr *)&my_addr, sizeof(struct sockaddr_ll));
- A complete tutorial is available at: https://sites.google.com/site/packetmmap/
+ A complete tutorial is available at:
+ https://web.archive.org/web/20220404160947/https://sites.google.com/site/packetmmap/
By default, the user should put data at::
diff --git a/Documentation/networking/phy-link-topology.rst b/Documentation/networking/phy-link-topology.rst
new file mode 100644
index 000000000000..4dec5d7d6513
--- /dev/null
+++ b/Documentation/networking/phy-link-topology.rst
@@ -0,0 +1,121 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. _phy_link_topology:
+
+=================
+PHY link topology
+=================
+
+Overview
+========
+
+The PHY link topology representation in the networking stack aims at representing
+the hardware layout for any given Ethernet link.
+
+An Ethernet interface from userspace's point of view is nothing but a
+:c:type:`struct net_device <net_device>`, which exposes configuration options
+through the legacy ioctls and the ethtool netlink commands. The base assumption
+when designing these configuration APIs were that the link looks something like ::
+
+ +-----------------------+ +----------+ +--------------+
+ | Ethernet Controller / | | Ethernet | | Connector / |
+ | MAC | ------ | PHY | ---- | Port | ---... to LP
+ +-----------------------+ +----------+ +--------------+
+ struct net_device struct phy_device
+
+Commands that needs to configure the PHY will go through the net_device.phydev
+field to reach the PHY and perform the relevant configuration.
+
+This assumption falls apart in more complex topologies that can arise when,
+for example, using SFP transceivers (although that's not the only specific case).
+
+Here, we have 2 basic scenarios. Either the MAC is able to output a serialized
+interface, that can directly be fed to an SFP cage, such as SGMII, 1000BaseX,
+10GBaseR, etc.
+
+The link topology then looks like this (when an SFP module is inserted) ::
+
+ +-----+ SGMII +------------+
+ | MAC | ------- | SFP Module |
+ +-----+ +------------+
+
+Knowing that some modules embed a PHY, the actual link is more like ::
+
+ +-----+ SGMII +--------------+
+ | MAC | -------- | PHY (on SFP) |
+ +-----+ +--------------+
+
+In this case, the SFP PHY is handled by phylib, and registered by phylink through
+its SFP upstream ops.
+
+Now some Ethernet controllers aren't able to output a serialized interface, so
+we can't directly connect them to an SFP cage. However, some PHYs can be used
+as media-converters, to translate the non-serialized MAC MII interface to a
+serialized MII interface fed to the SFP ::
+
+ +-----+ RGMII +-----------------------+ SGMII +--------------+
+ | MAC | ------- | PHY (media converter) | ------- | PHY (on SFP) |
+ +-----+ +-----------------------+ +--------------+
+
+This is where the model of having a single net_device.phydev pointer shows its
+limitations, as we now have 2 PHYs on the link.
+
+The phy_link topology framework aims at providing a way to keep track of every
+PHY on the link, for use by both kernel drivers and subsystems, but also to
+report the topology to userspace, allowing to target individual PHYs in configuration
+commands.
+
+API
+===
+
+The :c:type:`struct phy_link_topology <phy_link_topology>` is a per-netdevice
+resource, that gets initialized at netdevice creation. Once it's initialized,
+it is then possible to register PHYs to the topology through :
+
+:c:func:`phy_link_topo_add_phy`
+
+Besides registering the PHY to the topology, this call will also assign a unique
+index to the PHY, which can then be reported to userspace to refer to this PHY
+(akin to the ifindex). This index is a u32, ranging from 1 to U32_MAX. The value
+0 is reserved to indicate the PHY doesn't belong to any topology yet.
+
+The PHY can then be removed from the topology through
+
+:c:func:`phy_link_topo_del_phy`
+
+These function are already hooked into the phylib subsystem, so all PHYs that
+are linked to a net_device through :c:func:`phy_attach_direct` will automatically
+join the netdev's topology.
+
+PHYs that are on a SFP module will also be automatically registered IF the SFP
+upstream is phylink (so, no media-converter).
+
+PHY drivers that can be used as SFP upstream need to call :c:func:`phy_sfp_attach_phy`
+and :c:func:`phy_sfp_detach_phy`, which can be used as a
+.attach_phy / .detach_phy implementation for the
+:c:type:`struct sfp_upstream_ops <sfp_upstream_ops>`.
+
+UAPI
+====
+
+There exist a set of netlink commands to query the link topology from userspace,
+see ``Documentation/networking/ethtool-netlink.rst``.
+
+The whole point of having a topology representation is to assign the phyindex
+field in :c:type:`struct phy_device <phy_device>`. This index is reported to
+userspace using the ``ETHTOOL_MSG_PHY_GET`` ethtnl command. Performing a DUMP operation
+will result in all PHYs from all net_device being listed. The DUMP command
+accepts either a ``ETHTOOL_A_HEADER_DEV_INDEX`` or ``ETHTOOL_A_HEADER_DEV_NAME``
+to be passed in the request to filter the DUMP to a single net_device.
+
+The retrieved index can then be passed as a request parameter using the
+``ETHTOOL_A_HEADER_PHY_INDEX`` field in the following ethnl commands :
+
+* ``ETHTOOL_MSG_STRSET_GET`` to get the stats string set from a given PHY
+* ``ETHTOOL_MSG_CABLE_TEST_ACT`` and ``ETHTOOL_MSG_CABLE_TEST_ACT``, to perform
+ cable testing on a given PHY on the link (most likely the outermost PHY)
+* ``ETHTOOL_MSG_PSE_SET`` and ``ETHTOOL_MSG_PSE_GET`` for PHY-controlled PoE and PSE settings
+* ``ETHTOOL_MSG_PLCA_GET_CFG``, ``ETHTOOL_MSG_PLCA_SET_CFG`` and ``ETHTOOL_MSG_PLCA_GET_STATUS``
+ to set the PLCA (Physical Layer Collision Avoidance) parameters
+
+Note that the PHY index can be passed to other requests, which will silently
+ignore it if present and irrelevant.
diff --git a/Documentation/networking/phy.rst b/Documentation/networking/phy.rst
index 1283240d7620..f64641417c54 100644
--- a/Documentation/networking/phy.rst
+++ b/Documentation/networking/phy.rst
@@ -327,6 +327,12 @@ Some of the interface modes are described below:
This is the Penta SGMII mode, it is similar to QSGMII but it combines 5
SGMII lines into a single link compared to 4 on QSGMII.
+``PHY_INTERFACE_MODE_10G_QXGMII``
+ Represents the 10G-QXGMII PHY-MAC interface as defined by the Cisco USXGMII
+ Multiport Copper Interface document. It supports 4 ports over a 10.3125 GHz
+ SerDes lane, each port having speeds of 2.5G / 1G / 100M / 10M achieved
+ through symbol replication. The PCS expects the standard USXGMII code word.
+
Pause frames / flow control
===========================
diff --git a/Documentation/networking/pse-pd/index.rst b/Documentation/networking/pse-pd/index.rst
new file mode 100644
index 000000000000..de28a5aee316
--- /dev/null
+++ b/Documentation/networking/pse-pd/index.rst
@@ -0,0 +1,10 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Power Sourcing Equipment (PSE) Documentation
+============================================
+
+.. toctree::
+ :maxdepth: 2
+
+ introduction
+ pse-pi
diff --git a/Documentation/networking/pse-pd/introduction.rst b/Documentation/networking/pse-pd/introduction.rst
new file mode 100644
index 000000000000..e3d3faaef717
--- /dev/null
+++ b/Documentation/networking/pse-pd/introduction.rst
@@ -0,0 +1,73 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Power Sourcing Equipment (PSE) in IEEE 802.3 Standard
+=====================================================
+
+Overview
+--------
+
+Power Sourcing Equipment (PSE) is essential in networks for delivering power
+along with data over Ethernet cables. It usually refers to devices like
+switches and hubs that supply power to Powered Devices (PDs) such as IP
+cameras, VoIP phones, and wireless access points.
+
+PSE vs. PoDL PSE
+----------------
+
+PSE in the IEEE 802.3 standard generally refers to equipment that provides
+power alongside data over Ethernet cables, typically associated with Power over
+Ethernet (PoE).
+
+PoDL PSE, or Power over Data Lines PSE, specifically denotes PSEs operating
+with single balanced twisted-pair PHYs, as per Clause 104 of IEEE 802.3. PoDL
+is significant in contexts like automotive and industrial controls where power
+and data delivery over a single pair is advantageous.
+
+IEEE 802.3-2018 Addendums and Related Clauses
+---------------------------------------------
+
+Key addenda to the IEEE 802.3-2018 standard relevant to power delivery over
+Ethernet are as follows:
+
+- **802.3af (Approved in 2003-06-12)**: Known as PoE in the market, detailed in
+ Clause 33, delivering up to 15.4W of power.
+- **802.3at (Approved in 2009-09-11)**: Marketed as PoE+, enhancing PoE as
+ covered in Clause 33, increasing power delivery to up to 30W.
+- **802.3bt (Approved in 2018-09-27)**: Known as 4PPoE in the market, outlined
+ in Clause 33. Type 3 delivers up to 60W, and Type 4 up to 100W.
+- **802.3bu (Approved in 2016-12-07)**: Formerly referred to as PoDL, detailed
+ in Clause 104. Introduces Classes 0 - 9. Class 9 PoDL PSE delivers up to ~65W
+
+Kernel Naming Convention Recommendations
+----------------------------------------
+
+For clarity and consistency within the Linux kernel's networking subsystem, the
+following naming conventions are recommended:
+
+- For general PSE (PoE) code, use "c33_pse" key words. For example:
+ ``enum ethtool_c33_pse_admin_state c33_admin_control;``.
+ This aligns with Clause 33, encompassing various PoE forms.
+
+- For PoDL PSE - specific code, use "podl_pse". For example:
+ ``enum ethtool_podl_pse_admin_state podl_admin_control;`` to differentiate
+ PoDL PSE settings according to Clause 104.
+
+Summary of Clause 33: Data Terminal Equipment (DTE) Power via Media Dependent Interface (MDI)
+---------------------------------------------------------------------------------------------
+
+Clause 33 of the IEEE 802.3 standard defines the functional and electrical
+characteristics of Powered Device (PD) and Power Sourcing Equipment (PSE).
+These entities enable power delivery using the same generic cabling as for data
+transmission, integrating power with data communication for devices such as
+10BASE-T, 100BASE-TX, or 1000BASE-T.
+
+Summary of Clause 104: Power over Data Lines (PoDL) of Single Balanced Twisted-Pair Ethernet
+--------------------------------------------------------------------------------------------
+
+Clause 104 of the IEEE 802.3 standard delineates the functional and electrical
+characteristics of PoDL Powered Devices (PDs) and PoDL Power Sourcing Equipment
+(PSEs). These are designed for use with single balanced twisted-pair Ethernet
+Physical Layers. In this clause, 'PSE' refers specifically to PoDL PSE, and
+'PD' to PoDL PD. The key intent is to provide devices with a unified interface
+for both data and the power required to process this data over a single
+balanced twisted-pair Ethernet connection.
diff --git a/Documentation/networking/pse-pd/pse-pi.rst b/Documentation/networking/pse-pd/pse-pi.rst
new file mode 100644
index 000000000000..5cad14fedc13
--- /dev/null
+++ b/Documentation/networking/pse-pd/pse-pi.rst
@@ -0,0 +1,301 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+PSE Power Interface (PSE PI) Documentation
+==========================================
+
+The Power Sourcing Equipment Power Interface (PSE PI) plays a pivotal role in
+the architecture of Power over Ethernet (PoE) systems. It is essentially a
+blueprint that outlines how one or multiple power sources are connected to the
+eight-pin modular jack, commonly known as the Ethernet RJ45 port. This
+connection scheme is crucial for enabling the delivery of power alongside data
+over Ethernet cables.
+
+Documentation and Standards
+---------------------------
+
+The IEEE 802.3 standard provides detailed documentation on the PSE PI.
+Specifically:
+
+- Section "33.2.3 PI pin assignments" covers the pin assignments for PoE
+ systems that utilize two pairs for power delivery.
+- Section "145.2.4 PSE PI" addresses the configuration for PoE systems that
+ deliver power over all four pairs of an Ethernet cable.
+
+PSE PI and Single Pair Ethernet
+-------------------------------
+
+Single Pair Ethernet (SPE) represents a different approach to Ethernet
+connectivity, utilizing just one pair of conductors for both data and power
+transmission. Unlike the configurations detailed in the PSE PI for standard
+Ethernet, which can involve multiple power sourcing arrangements across four or
+two pairs of wires, SPE operates on a simpler model due to its single-pair
+design. As a result, the complexities of choosing between alternative pin
+assignments for power delivery, as described in the PSE PI for multi-pair
+Ethernet, are not applicable to SPE.
+
+Understanding PSE PI
+--------------------
+
+The Power Sourcing Equipment Power Interface (PSE PI) is a framework defining
+how Power Sourcing Equipment (PSE) delivers power to Powered Devices (PDs) over
+Ethernet cables. It details two main configurations for power delivery, known
+as Alternative A and Alternative B, which are distinguished not only by their
+method of power transmission but also by the implications for polarity and data
+transmission direction.
+
+Alternative A and B Overview
+----------------------------
+
+- **Alternative A:** Utilizes RJ45 conductors 1, 2, 3 and 6. In either case of
+ networks 10/100BaseT or 1G/2G/5G/10GBaseT, the pairs used are carrying data.
+ The power delivery's polarity in this alternative can vary based on the MDI
+ (Medium Dependent Interface) or MDI-X (Medium Dependent Interface Crossover)
+ configuration.
+
+- **Alternative B:** Utilizes RJ45 conductors 4, 5, 7 and 8. In case of
+ 10/100BaseT network the pairs used are spare pairs without data and are less
+ influenced by data transmission direction. This is not the case for
+ 1G/2G/5G/10GBaseT network. Alternative B includes two configurations with
+ different polarities, known as variant X and variant S, to accommodate
+ different network requirements and device specifications.
+
+Table 145-3 PSE Pinout Alternatives
+-----------------------------------
+
+The following table outlines the pin configurations for both Alternative A and
+Alternative B.
+
++------------+-------------------+-----------------+-----------------+-----------------+
+| Conductor | Alternative A | Alternative A | Alternative B | Alternative B |
+| | (MDI-X) | (MDI) | (X) | (S) |
++============+===================+=================+=================+=================+
+| 1 | Negative V | Positive V | - | - |
++------------+-------------------+-----------------+-----------------+-----------------+
+| 2 | Negative V | Positive V | - | - |
++------------+-------------------+-----------------+-----------------+-----------------+
+| 3 | Positive V | Negative V | - | - |
++------------+-------------------+-----------------+-----------------+-----------------+
+| 4 | - | - | Negative V | Positive V |
++------------+-------------------+-----------------+-----------------+-----------------+
+| 5 | - | - | Negative V | Positive V |
++------------+-------------------+-----------------+-----------------+-----------------+
+| 6 | Positive V | Negative V | - | - |
++------------+-------------------+-----------------+-----------------+-----------------+
+| 7 | - | - | Positive V | Negative V |
++------------+-------------------+-----------------+-----------------+-----------------+
+| 8 | - | - | Positive V | Negative V |
++------------+-------------------+-----------------+-----------------+-----------------+
+
+.. note::
+ - "Positive V" and "Negative V" indicate the voltage polarity for each pin.
+ - "-" indicates that the pin is not used for power delivery in that
+ specific configuration.
+
+PSE PI compatibilities
+----------------------
+
+The following table outlines the compatibility between the pinout alternative
+and the 1000/2.5G/5G/10GBaseT in the PSE 2 pairs connection.
+
++---------+---------------+---------------------+-----------------------+
+| Variant | Alternative | Power Feeding Type | Compatibility with |
+| | (A/B) | (Direct/Phantom) | 1000/2.5G/5G/10GBaseT |
++=========+===============+=====================+=======================+
+| 1 | A | Phantom | Yes |
++---------+---------------+---------------------+-----------------------+
+| 2 | B | Phantom | Yes |
++---------+---------------+---------------------+-----------------------+
+| 3 | B | Direct | No |
++---------+---------------+---------------------+-----------------------+
+
+.. note::
+ - "Direct" indicate a variant where the power is injected directly to pairs
+ without using magnetics in case of spare pairs.
+ - "Phantom" indicate power path over coils/magnetics as it is done for
+ Alternative A variant.
+
+In case of PSE 4 pairs, a PSE supporting only 10/100BaseT (which mean Direct
+Power on pinout Alternative B) is not compatible with a 4 pairs
+1000/2.5G/5G/10GBaseT.
+
+PSE Power Interface (PSE PI) Connection Diagram
+-----------------------------------------------
+
+The diagram below illustrates the connection architecture between the RJ45
+port, the Ethernet PHY (Physical Layer), and the PSE PI (Power Sourcing
+Equipment Power Interface), demonstrating how power and data are delivered
+simultaneously through an Ethernet cable. The RJ45 port serves as the physical
+interface for these connections, with each of its eight pins connected to both
+the Ethernet PHY for data transmission and the PSE PI for power delivery.
+
+.. code-block::
+
+ +--------------------------+
+ | |
+ | RJ45 Port |
+ | |
+ +--+--+--+--+--+--+--+--+--+ +-------------+
+ 1| 2| 3| 4| 5| 6| 7| 8| | |
+ | | | | | | | o-------------------+ |
+ | | | | | | o--|-------------------+ +<--- PSE 1
+ | | | | | o--|--|-------------------+ |
+ | | | | o--|--|--|-------------------+ |
+ | | | o--|--|--|--|-------------------+ PSE PI |
+ | | o--|--|--|--|--|-------------------+ |
+ | o--|--|--|--|--|--|-------------------+ +<--- PSE 2 (optional)
+ o--|--|--|--|--|--|--|-------------------+ |
+ | | | | | | | | | |
+ +--+--+--+--+--+--+--+--+--+ +-------------+
+ | |
+ | Ethernet PHY |
+ | |
+ +--------------------------+
+
+Simple PSE PI Configuration for Alternative A
+---------------------------------------------
+
+The diagram below illustrates a straightforward PSE PI (Power Sourcing
+Equipment Power Interface) configuration designed to support the Alternative A
+setup for Power over Ethernet (PoE). This implementation is tailored to provide
+power delivery through the data-carrying pairs of an Ethernet cable, suitable
+for either MDI or MDI-X configurations, albeit supporting one variation at a
+time.
+
+.. code-block::
+
+ +-------------+
+ | PSE PI |
+ 8 -----+ +-------------+
+ 7 -----+ Rail 1 |
+ 6 -----+------+----------------------+
+ 5 -----+ | |
+ 4 -----+ | Rail 2 | PSE 1
+ 3 -----+------/ +------------+
+ 2 -----+--+-------------/ |
+ 1 -----+--/ +-------------+
+ |
+ +-------------+
+
+In this configuration:
+
+- Pins 1 and 2, as well as pins 3 and 6, are utilized for power delivery in
+ addition to data transmission. This aligns with the standard wiring for
+ 10/100BaseT Ethernet networks where these pairs are used for data.
+- Rail 1 and Rail 2 represent the positive and negative voltage rails, with
+ Rail 1 connected to pins 1 and 2, and Rail 2 connected to pins 3 and 6.
+ More advanced PSE PI configurations may include integrated or external
+ switches to change the polarity of the voltage rails, allowing for
+ compatibility with both MDI and MDI-X configurations.
+
+More complex PSE PI configurations may include additional components, to support
+Alternative B, or to provide additional features such as power management, or
+additional power delivery capabilities such as 2-pair or 4-pair power delivery.
+
+.. code-block::
+
+ +-------------+
+ | PSE PI |
+ | +---+
+ 8 -----+--------+ | +-------------+
+ 7 -----+--------+ | Rail 1 |
+ 6 -----+--------+ +-----------------+
+ 5 -----+--------+ | |
+ 4 -----+--------+ | Rail 2 | PSE 1
+ 3 -----+--------+ +----------------+
+ 2 -----+--------+ | |
+ 1 -----+--------+ | +-------------+
+ | +---+
+ +-------------+
+
+Device Tree Configuration: Describing PSE PI Configurations
+-----------------------------------------------------------
+
+The necessity for a separate PSE PI node in the device tree is influenced by
+the intricacy of the Power over Ethernet (PoE) system's setup. Here are
+descriptions of both simple and complex PSE PI configurations to illustrate
+this decision-making process:
+
+**Simple PSE PI Configuration:**
+In a straightforward scenario, the PSE PI setup involves a direct, one-to-one
+connection between a single PSE controller and an Ethernet port. This setup
+typically supports basic PoE functionality without the need for dynamic
+configuration or management of multiple power delivery modes. For such simple
+configurations, detailing the PSE PI within the existing PSE controller's node
+may suffice, as the system does not encompass additional complexity that
+warrants a separate node. The primary focus here is on the clear and direct
+association of power delivery to a specific Ethernet port.
+
+**Complex PSE PI Configuration:**
+Contrastingly, a complex PSE PI setup may encompass multiple PSE controllers or
+auxiliary circuits that collectively manage power delivery to one Ethernet
+port. Such configurations might support a range of PoE standards and require
+the capability to dynamically configure power delivery based on the operational
+mode (e.g., PoE2 versus PoE4) or specific requirements of connected devices. In
+these instances, a dedicated PSE PI node becomes essential for accurately
+documenting the system architecture. This node would serve to detail the
+interactions between different PSE controllers, the support for various PoE
+modes, and any additional logic required to coordinate power delivery across
+the network infrastructure.
+
+**Guidance:**
+
+For simple PSE setups, including PSE PI information in the PSE controller node
+might suffice due to the straightforward nature of these systems. However,
+complex configurations, involving multiple components or advanced PoE features,
+benefit from a dedicated PSE PI node. This method adheres to IEEE 802.3
+specifications, improving documentation clarity and ensuring accurate
+representation of the PoE system's complexity.
+
+PSE PI Node: Essential Information
+----------------------------------
+
+The PSE PI (Power Sourcing Equipment Power Interface) node in a device tree can
+include several key pieces of information critical for defining the power
+delivery capabilities and configurations of a PoE (Power over Ethernet) system.
+Below is a list of such information, along with explanations for their
+necessity and reasons why they might not be found within a PSE controller node:
+
+1. **Powered Pairs Configuration**
+
+ - *Description:* Identifies the pairs used for power delivery in the
+ Ethernet cable.
+ - *Necessity:* Essential to ensure the correct pairs are powered according
+ to the board's design.
+ - *PSE Controller Node:* Typically lacks details on physical pair usage,
+ focusing on power regulation.
+
+2. **Polarity of Powered Pairs**
+
+ - *Description:* Specifies the polarity (positive or negative) for each
+ powered pair.
+ - *Necessity:* Critical for safe and effective power transmission to PDs.
+ - *PSE Controller Node:* Polarity management may exceed the standard
+ functionalities of PSE controllers.
+
+3. **PSE Cells Association**
+
+ - *Description:* Details the association of PSE cells with Ethernet ports or
+ pairs in multi-cell configurations.
+ - *Necessity:* Allows for optimized power resource allocation in complex
+ systems.
+ - *PSE Controller Node:* Controllers may not manage cell associations
+ directly, focusing instead on power flow regulation.
+
+4. **Support for PoE Standards**
+
+ - *Description:* Lists the PoE standards and configurations supported by the
+ system.
+ - *Necessity:* Ensures system compatibility with various PDs and adherence
+ to industry standards.
+ - *PSE Controller Node:* Specific capabilities may depend on the overall PSE
+ PI design rather than the controller alone. Multiple PSE cells per PI
+ do not necessarily imply support for multiple PoE standards.
+
+5. **Protection Mechanisms**
+
+ - *Description:* Outlines additional protection mechanisms, such as
+ overcurrent protection and thermal management.
+ - *Necessity:* Provides extra safety and stability, complementing PSE
+ controller protections.
+ - *PSE Controller Node:* Some protections may be implemented via
+ board-specific hardware or algorithms external to the controller.
diff --git a/Documentation/networking/sriov.rst b/Documentation/networking/sriov.rst
new file mode 100644
index 000000000000..5deb4ff3154f
--- /dev/null
+++ b/Documentation/networking/sriov.rst
@@ -0,0 +1,25 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============
+NIC SR-IOV APIs
+===============
+
+Modern NICs are strongly encouraged to focus on implementing the ``switchdev``
+model (see :ref:`switchdev`) to configure forwarding and security of SR-IOV
+functionality.
+
+Legacy API
+==========
+
+The old SR-IOV API is implemented in ``rtnetlink`` Netlink family as part of
+the ``RTM_GETLINK`` and ``RTM_SETLINK`` commands. On the driver side
+it consists of a number of ``ndo_set_vf_*`` and ``ndo_get_vf_*`` callbacks.
+
+Since the legacy APIs do not integrate well with the rest of the stack
+the API is considered frozen; no new functionality or extensions
+will be accepted. New drivers should not implement the uncommon callbacks;
+namely the following callbacks are off limits:
+
+ - ``ndo_get_vf_port``
+ - ``ndo_set_vf_port``
+ - ``ndo_set_vf_rss_query_en``
diff --git a/Documentation/networking/switchdev.rst b/Documentation/networking/switchdev.rst
index 758f1dae3fce..f355f0166f1b 100644
--- a/Documentation/networking/switchdev.rst
+++ b/Documentation/networking/switchdev.rst
@@ -137,10 +137,10 @@ would be sub-port 0 on port 1 on switch 1.
Port Features
^^^^^^^^^^^^^
-NETIF_F_NETNS_LOCAL
+dev->netns_local
If the switchdev driver (and device) only supports offloading of the default
-network namespace (netns), the driver should set this feature flag to prevent
+network namespace (netns), the driver should set this private flag to prevent
the port netdev from being moved out of the default netns. A netns-aware
driver/device would not set this flag and be responsible for partitioning
hardware to preserve netns containment. This means hardware cannot forward
diff --git a/Documentation/networking/tcp_ao.rst b/Documentation/networking/tcp_ao.rst
index 8a58321acce7..d5b6d0df63c3 100644
--- a/Documentation/networking/tcp_ao.rst
+++ b/Documentation/networking/tcp_ao.rst
@@ -9,7 +9,7 @@ segments between trusted peers. It adds a new TCP header option with
a Message Authentication Code (MAC). MACs are produced from the content
of a TCP segment using a hashing function with a password known to both peers.
The intent of TCP-AO is to deprecate TCP-MD5 providing better security,
-key rotation and support for variety of hashing algorithms.
+key rotation and support for a variety of hashing algorithms.
1. Introduction
===============
@@ -164,9 +164,9 @@ A: It should not, no action needs to be performed [7.5.2.e]::
is not available, no action is required (RNextKeyID of a received
segment needs to match the MKT’s SendID).
-Q: How current_key is set and when does it change? It is a user-triggered
-change, or is it by a request from the remote peer? Is it set by the user
-explicitly, or by a matching rule?
+Q: How is current_key set, and when does it change? Is it a user-triggered
+change, or is it triggered by a request from the remote peer? Is it set by the
+user explicitly, or by a matching rule?
A: current_key is set by RNextKeyID [6.1]::
@@ -233,8 +233,8 @@ always have one current_key [3.3]::
Q: Can a non-TCP-AO connection become a TCP-AO-enabled one?
-A: No: for already established non-TCP-AO connection it would be impossible
-to switch using TCP-AO as the traffic key generation requires the initial
+A: No: for an already established non-TCP-AO connection it would be impossible
+to switch to using TCP-AO, as the traffic key generation requires the initial
sequence numbers. Paraphrasing, starting using TCP-AO would require
re-establishing the TCP connection.
@@ -292,7 +292,7 @@ no transparency is really needed and modern BGP daemons already have
Linux provides a set of ``setsockopt()s`` and ``getsockopt()s`` that let
userspace manage TCP-AO on a per-socket basis. In order to add/delete MKTs
-``TCP_AO_ADD_KEY`` and ``TCP_AO_DEL_KEY`` TCP socket options must be used
+``TCP_AO_ADD_KEY`` and ``TCP_AO_DEL_KEY`` TCP socket options must be used.
It is not allowed to add a key on an established non-TCP-AO connection
as well as to remove the last key from TCP-AO connection.
@@ -337,6 +337,15 @@ TCP-AO per-socket counters are also duplicated with per-netns counters,
exposed with SNMP. Those are ``TCPAOGood``, ``TCPAOBad``, ``TCPAOKeyNotFound``,
``TCPAORequired`` and ``TCPAODroppedIcmps``.
+For monitoring purposes, there are following TCP-AO trace events:
+``tcp_hash_bad_header``, ``tcp_hash_ao_required``, ``tcp_ao_handshake_failure``,
+``tcp_ao_wrong_maclen``, ``tcp_ao_wrong_maclen``, ``tcp_ao_key_not_found``,
+``tcp_ao_rnext_request``, ``tcp_ao_synack_no_key``, ``tcp_ao_snd_sne_update``,
+``tcp_ao_rcv_sne_update``. It's possible to separately enable any of them and
+one can filter them by net-namespace, 4-tuple, family, L3 index, and TCP header
+flags. If a segment has a TCP-AO header, the filters may also include
+keyid, rnext, and maclen. SNE updates include the rolled-over numbers.
+
RFC 5925 very permissively specifies how TCP port matching can be done for
MKTs::
@@ -352,7 +361,7 @@ not implemented.
4. ``setsockopt()`` vs ``accept()`` race
========================================
-In contrast with TCP-MD5 established connection which has just one key,
+In contrast with an established TCP-MD5 connection which has just one key,
TCP-AO connections may have many keys, which means that accepted connections
on a listen socket may have any amount of keys as well. As copying all those
keys on a first properly signed SYN would make the request socket bigger, that
@@ -365,7 +374,7 @@ keys from sockets that were already established, but not yet ``accept()``'ed,
hanging in the accept queue.
The reverse is valid as well: if userspace adds a new key for a peer on
-a listener socket, the established sockets in accept queue won't
+a listener socket, the established sockets in the accept queue won't
have the new keys.
At this moment, the resolution for the two races:
@@ -373,7 +382,7 @@ At this moment, the resolution for the two races:
and ``setsockopt(TCP_AO_DEL_KEY)`` vs ``accept()`` is delegated to userspace.
This means that it's expected that userspace would check the MKTs on the socket
that was returned by ``accept()`` to verify that any key rotation that
-happened on listen socket is reflected on the newly established connection.
+happened on the listen socket is reflected on the newly established connection.
This is a similar "do-nothing" approach to TCP-MD5 from the kernel side and
may be changed later by introducing new flags to ``tcp_ao_add``
diff --git a/Documentation/networking/timestamping.rst b/Documentation/networking/timestamping.rst
index 5e93cd71f99f..61ef9da10e28 100644
--- a/Documentation/networking/timestamping.rst
+++ b/Documentation/networking/timestamping.rst
@@ -158,7 +158,8 @@ SOF_TIMESTAMPING_SYS_HARDWARE:
SOF_TIMESTAMPING_RAW_HARDWARE:
Report hardware timestamps as generated by
- SOF_TIMESTAMPING_TX_HARDWARE when available.
+ SOF_TIMESTAMPING_TX_HARDWARE or SOF_TIMESTAMPING_RX_HARDWARE
+ when available.
1.3.3 Timestamp Options
@@ -193,6 +194,20 @@ SOF_TIMESTAMPING_OPT_ID:
among all possibly concurrently outstanding timestamp requests for
that socket.
+ The process can optionally override the default generated ID, by
+ passing a specific ID with control message SCM_TS_OPT_ID (not
+ supported for TCP sockets)::
+
+ struct msghdr *msg;
+ ...
+ cmsg = CMSG_FIRSTHDR(msg);
+ cmsg->cmsg_level = SOL_SOCKET;
+ cmsg->cmsg_type = SCM_TS_OPT_ID;
+ cmsg->cmsg_len = CMSG_LEN(sizeof(__u32));
+ *((__u32 *) CMSG_DATA(cmsg)) = opt_id;
+ err = sendmsg(fd, msg, 0);
+
+
SOF_TIMESTAMPING_OPT_ID_TCP:
Pass this modifier along with SOF_TIMESTAMPING_OPT_ID for new TCP
timestamping applications. SOF_TIMESTAMPING_OPT_ID defines how the
@@ -266,6 +281,23 @@ SOF_TIMESTAMPING_OPT_TX_SWHW:
two separate messages will be looped to the socket's error queue,
each containing just one timestamp.
+SOF_TIMESTAMPING_OPT_RX_FILTER:
+ Filter out spurious receive timestamps: report a receive timestamp
+ only if the matching timestamp generation flag is enabled.
+
+ Receive timestamps are generated early in the ingress path, before a
+ packet's destination socket is known. If any socket enables receive
+ timestamps, packets for all socket will receive timestamped packets.
+ Including those that request timestamp reporting with
+ SOF_TIMESTAMPING_SOFTWARE and/or SOF_TIMESTAMPING_RAW_HARDWARE, but
+ do not request receive timestamp generation. This can happen when
+ requesting transmit timestamps only.
+
+ Receiving spurious timestamps is generally benign. A process can
+ ignore the unexpected non-zero value. But it makes behavior subtly
+ dependent on other sockets. This flag isolates the socket for more
+ deterministic behavior.
+
New applications are encouraged to pass SOF_TIMESTAMPING_OPT_ID to
disambiguate timestamps and SOF_TIMESTAMPING_OPT_TSONLY to operate
regardless of the setting of sysctl net.core.tstamp_allow_data.
@@ -493,8 +525,8 @@ implicitly defined. ts[0] holds a software timestamp if set, ts[1]
is again deprecated and ts[2] holds a hardware timestamp if set.
-3. Hardware Timestamping configuration: SIOCSHWTSTAMP and SIOCGHWTSTAMP
-=======================================================================
+3. Hardware Timestamping configuration: ETHTOOL_MSG_TSCONFIG_SET/GET
+====================================================================
Hardware time stamping must also be initialized for each device driver
that is expected to do hardware time stamping. The parameter is defined in
@@ -507,12 +539,14 @@ include/uapi/linux/net_tstamp.h as::
};
Desired behavior is passed into the kernel and to a specific device by
-calling ioctl(SIOCSHWTSTAMP) with a pointer to a struct ifreq whose
-ifr_data points to a struct hwtstamp_config. The tx_type and
-rx_filter are hints to the driver what it is expected to do. If
-the requested fine-grained filtering for incoming packets is not
-supported, the driver may time stamp more than just the requested types
-of packets.
+calling the tsconfig netlink socket ``ETHTOOL_MSG_TSCONFIG_SET``.
+The ``ETHTOOL_A_TSCONFIG_TX_TYPES``, ``ETHTOOL_A_TSCONFIG_RX_FILTERS`` and
+``ETHTOOL_A_TSCONFIG_HWTSTAMP_FLAGS`` netlink attributes are then used to set
+the struct hwtstamp_config accordingly.
+
+The ``ETHTOOL_A_TSCONFIG_HWTSTAMP_PROVIDER`` netlink nested attribute is used
+to select the source of the hardware time stamping. It is composed of an index
+for the device source and a qualifier for the type of time stamping.
Drivers are free to use a more permissive configuration than the requested
configuration. It is expected that drivers should only implement directly the
@@ -531,9 +565,16 @@ Only a processes with admin rights may change the configuration. User
space is responsible to ensure that multiple processes don't interfere
with each other and that the settings are reset.
-Any process can read the actual configuration by passing this
-structure to ioctl(SIOCGHWTSTAMP) in the same way. However, this has
-not been implemented in all drivers.
+Any process can read the actual configuration by requesting tsconfig netlink
+socket ``ETHTOOL_MSG_TSCONFIG_GET``.
+
+The legacy configuration is the use of the ioctl(SIOCSHWTSTAMP) with a pointer
+to a struct ifreq whose ifr_data points to a struct hwtstamp_config.
+The tx_type and rx_filter are hints to the driver what it is expected to do.
+If the requested fine-grained filtering for incoming packets is not
+supported, the driver may time stamp more than just the requested types
+of packets. ioctl(SIOCGHWTSTAMP) is used in the same way as the
+ioctl(SIOCSHWTSTAMP). However, this has not been implemented in all drivers.
::
@@ -578,9 +619,10 @@ not been implemented in all drivers.
--------------------------------------------------------
A driver which supports hardware time stamping must support the
-SIOCSHWTSTAMP ioctl and update the supplied struct hwtstamp_config with
-the actual values as described in the section on SIOCSHWTSTAMP. It
-should also support SIOCGHWTSTAMP.
+ndo_hwtstamp_set NDO or the legacy SIOCSHWTSTAMP ioctl and update the
+supplied struct hwtstamp_config with the actual values as described in
+the section on SIOCSHWTSTAMP. It should also support ndo_hwtstamp_get or
+the legacy SIOCGHWTSTAMP.
Time stamps for received packets must be stored in the skb. To get a pointer
to the shared time stamp structure of the skb call skb_hwtstamps(). Then
diff --git a/Documentation/networking/tipc.rst b/Documentation/networking/tipc.rst
index ab63d298cca2..9b375b9b9981 100644
--- a/Documentation/networking/tipc.rst
+++ b/Documentation/networking/tipc.rst
@@ -112,7 +112,7 @@ More Information
- How to contribute to TIPC:
-- http://tipc.io/contacts.html
+ http://tipc.io/contacts.html
- More details about TIPC specification:
diff --git a/Documentation/networking/tls-offload.rst b/Documentation/networking/tls-offload.rst
index 5f0dea3d571e..7354d48cdf92 100644
--- a/Documentation/networking/tls-offload.rst
+++ b/Documentation/networking/tls-offload.rst
@@ -51,7 +51,7 @@ and send them to the device for encryption and transmission.
RX
--
-On the receive side if the device handled decryption and authentication
+On the receive side, if the device handled decryption and authentication
successfully, the driver will set the decrypted bit in the associated
:c:type:`struct sk_buff <sk_buff>`. The packets reach the TCP stack and
are handled normally. ``ktls`` is informed when data is queued to the socket
@@ -120,8 +120,9 @@ before installing the connection state in the kernel.
RX
--
-In RX direction local networking stack has little control over the segmentation,
-so the initial records' TCP sequence number may be anywhere inside the segment.
+In the RX direction, the local networking stack has little control over
+segmentation, so the initial records' TCP sequence number may be anywhere
+inside the segment.
Normal operation
================
@@ -138,8 +139,8 @@ There are no guarantees on record length or record segmentation. In particular
segments may start at any point of a record and contain any number of records.
Assuming segments are received in order, the device should be able to perform
crypto operations and authentication regardless of segmentation. For this
-to be possible device has to keep small amount of segment-to-segment state.
-This includes at least:
+to be possible, the device has to keep a small amount of segment-to-segment
+state. This includes at least:
* partial headers (if a segment carried only a part of the TLS header)
* partial data block
@@ -175,12 +176,12 @@ and packet transformation functions) the device validates the Layer 4
checksum and performs a 5-tuple lookup to find any TLS connection the packet
may belong to (technically a 4-tuple
lookup is sufficient - IP addresses and TCP port numbers, as the protocol
-is always TCP). If connection is matched device confirms if the TCP sequence
-number is the expected one and proceeds to TLS handling (record delineation,
-decryption, authentication for each record in the packet). The device leaves
-the record framing unmodified, the stack takes care of record decapsulation.
-Device indicates successful handling of TLS offload in the per-packet context
-(descriptor) passed to the host.
+is always TCP). If the packet is matched to a connection, the device confirms
+if the TCP sequence number is the expected one and proceeds to TLS handling
+(record delineation, decryption, authentication for each record in the packet).
+The device leaves the record framing unmodified, the stack takes care of record
+decapsulation. Device indicates successful handling of TLS offload in the
+per-packet context (descriptor) passed to the host.
Upon reception of a TLS offloaded packet, the driver sets
the :c:member:`decrypted` mark in :c:type:`struct sk_buff <sk_buff>`
@@ -439,7 +440,7 @@ by the driver:
* ``rx_tls_resync_req_end`` - number of times the TLS async resync request
properly ended with providing the HW tracked tcp-seq.
* ``rx_tls_resync_req_skip`` - number of times the TLS async resync request
- procedure was started by not properly ended.
+ procedure was started but not properly ended.
* ``rx_tls_resync_res_ok`` - number of times the TLS resync response call to
the driver was successfully handled.
* ``rx_tls_resync_res_skip`` - number of times the TLS resync response call to
@@ -507,8 +508,8 @@ in packets as seen on the wire.
Transport layer transparency
----------------------------
-The device should not modify any packet headers for the purpose
-of the simplifying TLS offload.
+For the purpose of simplifying TLS offload, the device should not modify any
+packet headers.
The device should not depend on any packet headers beyond what is strictly
necessary for TLS offload.
diff --git a/Documentation/networking/tls.rst b/Documentation/networking/tls.rst
index 658ed3a71e1b..c7904a1bc167 100644
--- a/Documentation/networking/tls.rst
+++ b/Documentation/networking/tls.rst
@@ -200,6 +200,32 @@ received without a cmsg buffer set.
recv will never return data from mixed types of TLS records.
+TLS 1.3 Key Updates
+-------------------
+
+In TLS 1.3, KeyUpdate handshake messages signal that the sender is
+updating its TX key. Any message sent after a KeyUpdate will be
+encrypted using the new key. The userspace library can pass the new
+key to the kernel using the TLS_TX and TLS_RX socket options, as for
+the initial keys. TLS version and cipher cannot be changed.
+
+To prevent attempting to decrypt incoming records using the wrong key,
+decryption will be paused when a KeyUpdate message is received by the
+kernel, until the new key has been provided using the TLS_RX socket
+option. Any read occurring after the KeyUpdate has been read and
+before the new key is provided will fail with EKEYEXPIRED. poll() will
+not report any read events from the socket until the new key is
+provided. There is no pausing on the transmit side.
+
+Userspace should make sure that the crypto_info provided has been set
+properly. In particular, the kernel will not check for key/nonce
+reuse.
+
+The number of successful and failed key updates is tracked in the
+``TlsTxRekeyOk``, ``TlsRxRekeyOk``, ``TlsTxRekeyError``,
+``TlsRxRekeyError`` statistics. The ``TlsRxRekeyReceived`` statistic
+counts KeyUpdate handshake messages that have been received.
+
Integrating in to userspace TLS library
---------------------------------------
@@ -286,3 +312,13 @@ TLS implementation exposes the following per-namespace statistics
- ``TlsRxNoPadViolation`` -
number of data RX records which had to be re-decrypted due to
``TLS_RX_EXPECT_NO_PAD`` mis-prediction.
+
+- ``TlsTxRekeyOk``, ``TlsRxRekeyOk`` -
+ number of successful rekeys on existing sessions for TX and RX
+
+- ``TlsTxRekeyError``, ``TlsRxRekeyError`` -
+ number of failed rekeys on existing sessions for TX and RX
+
+- ``TlsRxRekeyReceived`` -
+ number of received KeyUpdate handshake messages, requiring userspace
+ to provide a new RX key
diff --git a/Documentation/networking/tproxy.rst b/Documentation/networking/tproxy.rst
index 00dc3a1a66b4..7f7c1ff6f159 100644
--- a/Documentation/networking/tproxy.rst
+++ b/Documentation/networking/tproxy.rst
@@ -17,7 +17,7 @@ The idea is that you identify packets with destination address matching a local
socket on your box, set the packet mark to a certain value::
# iptables -t mangle -N DIVERT
- # iptables -t mangle -A PREROUTING -p tcp -m socket -j DIVERT
+ # iptables -t mangle -A PREROUTING -p tcp -m socket --transparent -j DIVERT
# iptables -t mangle -A DIVERT -j MARK --set-mark 1
# iptables -t mangle -A DIVERT -j ACCEPT
diff --git a/Documentation/networking/xfrm_device.rst b/Documentation/networking/xfrm_device.rst
index bfea9d8579ed..66f6e9a9b59a 100644
--- a/Documentation/networking/xfrm_device.rst
+++ b/Documentation/networking/xfrm_device.rst
@@ -169,7 +169,8 @@ the stack in xfrm_input().
hand the packet to napi_gro_receive() as usual
-In ESN mode, xdo_dev_state_advance_esn() is called from xfrm_replay_advance_esn().
+In ESN mode, xdo_dev_state_advance_esn() is called from
+xfrm_replay_advance_esn() for RX, and xfrm_replay_overflow_offload_esn for TX.
Driver will check packet seq number and update HW ESN state machine if needed.
Packet offload mode:
diff --git a/Documentation/networking/xfrm_proc.rst b/Documentation/networking/xfrm_proc.rst
index 0a771c5a7399..973d1571acac 100644
--- a/Documentation/networking/xfrm_proc.rst
+++ b/Documentation/networking/xfrm_proc.rst
@@ -73,6 +73,9 @@ XfrmAcquireError:
XfrmFwdHdrError:
Forward routing of a packet is not allowed
+XfrmInStateDirError:
+ State direction mismatch (lookup found an output state on the input path, expected input or no direction)
+
Outbound errors
~~~~~~~~~~~~~~~
XfrmOutError:
@@ -111,3 +114,6 @@ XfrmOutPolError:
XfrmOutStateInvalid:
State is invalid, perhaps expired
+
+XfrmOutStateDirError:
+ State direction mismatch (lookup found an input state on the output path, expected output or no direction)
diff --git a/Documentation/networking/xsk-tx-metadata.rst b/Documentation/networking/xsk-tx-metadata.rst
index bd033fe95cca..e76b0cfc32f7 100644
--- a/Documentation/networking/xsk-tx-metadata.rst
+++ b/Documentation/networking/xsk-tx-metadata.rst
@@ -11,12 +11,16 @@ metadata on the receive side.
General Design
==============
-The headroom for the metadata is reserved via ``tx_metadata_len`` in
-``struct xdp_umem_reg``. The metadata length is therefore the same for
-every socket that shares the same umem. The metadata layout is a fixed UAPI,
-refer to ``union xsk_tx_metadata`` in ``include/uapi/linux/if_xdp.h``.
-Thus, generally, the ``tx_metadata_len`` field above should contain
-``sizeof(union xsk_tx_metadata)``.
+The headroom for the metadata is reserved via ``tx_metadata_len`` and
+``XDP_UMEM_TX_METADATA_LEN`` flag in ``struct xdp_umem_reg``. The metadata
+length is therefore the same for every socket that shares the same umem.
+The metadata layout is a fixed UAPI, refer to ``union xsk_tx_metadata`` in
+``include/uapi/linux/if_xdp.h``. Thus, generally, the ``tx_metadata_len``
+field above should contain ``sizeof(union xsk_tx_metadata)``.
+
+Note that in the original implementation the ``XDP_UMEM_TX_METADATA_LEN``
+flag was not required. Applications might attempt to create a umem
+with a flag first and if it fails, do another attempt without a flag.
The headroom and the metadata itself should be located right before
``xdp_desc->addr`` in the umem frame. Within a frame, the metadata