diff options
Diffstat (limited to 'Documentation/networking')
20 files changed, 349 insertions, 53 deletions
diff --git a/Documentation/networking/batman-adv.rst b/Documentation/networking/batman-adv.rst index 8a0dcb1894b4..44b9b5cc0e24 100644 --- a/Documentation/networking/batman-adv.rst +++ b/Documentation/networking/batman-adv.rst @@ -164,5 +164,5 @@ Mailing-list: You can also contact the Authors: -* Marek Lindner <mareklindner@neomailbox.ch> +* Marek Lindner <marek.lindner@mailbox.org> * Simon Wunderlich <sw@simonwunderlich.de> diff --git a/Documentation/networking/bonding.rst b/Documentation/networking/bonding.rst index 7c8d22d68682..a4c1291d2561 100644 --- a/Documentation/networking/bonding.rst +++ b/Documentation/networking/bonding.rst @@ -1963,7 +1963,7 @@ obtain its hardware address from the first slave, which might not match the hardware address of the VLAN interfaces (which was ultimately copied from an earlier slave). -There are two methods to insure that the VLAN device operates +There are two methods to ensure that the VLAN device operates with the correct hardware address if all slaves are removed from a bond interface: @@ -2078,7 +2078,7 @@ as an unsolicited ARP reply (because ARP matches replies on an interface basis), and is discarded. The MII monitor is not affected by the state of the routing table. -The solution here is simply to insure that slaves do not have +The solution here is simply to ensure that slaves do not have routes of their own, and if for some reason they must, those routes do not supersede routes of their master. This should generally be the case, but unusual configurations or errant manual or automatic static @@ -2295,7 +2295,7 @@ active-backup: the switches have an ISL and play together well. If the network configuration is such that one switch is specifically a backup switch (e.g., has lower capacity, higher cost, etc), - then the primary option can be used to insure that the + then the primary option can be used to ensure that the preferred link is always used when it is available. broadcast: @@ -2322,7 +2322,7 @@ monitor can provide a higher level of reliability in detecting end to end connectivity failures (which may be caused by the failure of any individual component to pass traffic for any reason). Additionally, the ARP monitor should be configured with multiple targets (at least -one for each switch in the network). This will insure that, +one for each switch in the network). This will ensure that, regardless of which switch is active, the ARP monitor has a suitable target to query. diff --git a/Documentation/networking/device_drivers/ethernet/intel/i40e.rst b/Documentation/networking/device_drivers/ethernet/intel/i40e.rst index 4fbaa1a2d674..53d9d5829d69 100644 --- a/Documentation/networking/device_drivers/ethernet/intel/i40e.rst +++ b/Documentation/networking/device_drivers/ethernet/intel/i40e.rst @@ -299,6 +299,18 @@ Use ethtool to view and set link-down-on-close, as follows:: ethtool --show-priv-flags ethX ethtool --set-priv-flags ethX link-down-on-close [on|off] +Setting the mdd-auto-reset-vf Private Flag +------------------------------------------ + +When the mdd-auto-reset-vf private flag is set to "on", the problematic VF will +be automatically reset if a malformed descriptor is detected. If the flag is +set to "off", the problematic VF will be disabled. + +Use ethtool to view and set mdd-auto-reset-vf, as follows:: + + ethtool --show-priv-flags ethX + ethtool --set-priv-flags ethX mdd-auto-reset-vf [on|off] + Viewing Link Messages --------------------- Link messages will not be displayed to the console if the distribution is diff --git a/Documentation/networking/devlink/mlx5.rst b/Documentation/networking/devlink/mlx5.rst index 456985407475..41618538fc70 100644 --- a/Documentation/networking/devlink/mlx5.rst +++ b/Documentation/networking/devlink/mlx5.rst @@ -53,6 +53,9 @@ parameters. * ``smfs`` Software managed flow steering. In SMFS mode, the HW steering entities are created and manage through the driver without firmware intervention. + * ``hmfs`` Hardware managed flow steering. In HMFS mode, the driver + is configuring steering rules directly to the HW using Work Queues with + a special new type of WQE (Work Queue Element). SMFS mode is faster and provides better rule insertion rate compared to default DMFS mode. diff --git a/Documentation/networking/diagnostic/twisted_pair_layer1_diagnostics.rst b/Documentation/networking/diagnostic/twisted_pair_layer1_diagnostics.rst index c9be5cc7e113..079e17effadf 100644 --- a/Documentation/networking/diagnostic/twisted_pair_layer1_diagnostics.rst +++ b/Documentation/networking/diagnostic/twisted_pair_layer1_diagnostics.rst @@ -713,17 +713,23 @@ driver supports reporting such events. - **Monitor Error Counters**: - - While some NIC drivers and PHYs provide error counters, there is no unified - set of PHY-specific counters across all hardware. Additionally, not all - PHYs provide useful information related to errors like CRC errors, frame - drops, or link flaps. Therefore, this step is dependent on the specific - hardware and driver support. - - - **Next Steps**: Use `ethtool -S <interface>` to check if your driver - provides useful error counters. In some cases, counters may provide - information about errors like link flaps or physical layer problems (e.g., - excessive CRC errors), but results can vary significantly depending on the - PHY. + - Use `ethtool -S <interface> --all-groups` to retrieve standardized interface + statistics if the driver supports the unified interface: + + - **Command:** `ethtool -S <interface> --all-groups` + + - **Example Output (if supported)**: + + .. code-block:: bash + + phydev-RxFrames: 100391 + phydev-RxErrors: 0 + phydev-TxFrames: 9 + phydev-TxErrors: 0 + + - If the unified interface is not supported, use `ethtool -S <interface>` to + retrieve MAC and PHY counters. Note that non-standardized PHY counter names + vary by driver and must be interpreted accordingly: - **Command:** `ethtool -S <interface>` @@ -740,6 +746,17 @@ driver supports reporting such events. condition) or kernel log messages (e.g., link up/down events) to further diagnose the issue. + - **Compare Counters**: + + - Compare the egress and ingress frame counts reported by the PHY and MAC. + + - A small difference may occur due to sampling rate differences between the + MAC and PHY drivers, or if the PHY and MAC are not always fully + synchronized in their UP or DOWN states. + + - Significant discrepancies indicate potential issues in the data path + between the MAC and PHY. + When All Else Fails... ~~~~~~~~~~~~~~~~~~~~~~ diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst index b25926071ece..3770a2294509 100644 --- a/Documentation/networking/ethtool-netlink.rst +++ b/Documentation/networking/ethtool-netlink.rst @@ -237,6 +237,8 @@ Userspace to kernel: ``ETHTOOL_MSG_MM_SET`` set MAC merge layer parameters ``ETHTOOL_MSG_MODULE_FW_FLASH_ACT`` flash transceiver module firmware ``ETHTOOL_MSG_PHY_GET`` get Ethernet PHY information + ``ETHTOOL_MSG_TSCONFIG_GET`` get hw timestamping configuration + ``ETHTOOL_MSG_TSCONFIG_SET`` set hw timestamping configuration ===================================== ================================= Kernel to userspace: @@ -286,6 +288,8 @@ Kernel to userspace: ``ETHTOOL_MSG_MODULE_FW_FLASH_NTF`` transceiver module flash updates ``ETHTOOL_MSG_PHY_GET_REPLY`` Ethernet PHY information ``ETHTOOL_MSG_PHY_NTF`` Ethernet PHY information change + ``ETHTOOL_MSG_TSCONFIG_GET_REPLY`` hw timestamping configuration + ``ETHTOOL_MSG_TSCONFIG_SET_REPLY`` new hw timestamping configuration ======================================== ================================= ``GET`` requests are sent by userspace applications to retrieve device @@ -895,6 +899,10 @@ Kernel response contents: ``ETHTOOL_A_RINGS_RX_PUSH`` u8 flag of RX Push mode ``ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN`` u32 size of TX push buffer ``ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN_MAX`` u32 max size of TX push buffer + ``ETHTOOL_A_RINGS_HDS_THRESH`` u32 threshold of + header / data split + ``ETHTOOL_A_RINGS_HDS_THRESH_MAX`` u32 max threshold of + header / data split ======================================= ====== =========================== ``ETHTOOL_A_RINGS_TCP_DATA_SPLIT`` indicates whether the device is usable with @@ -937,10 +945,12 @@ Request contents: ``ETHTOOL_A_RINGS_RX_JUMBO`` u32 size of RX jumbo ring ``ETHTOOL_A_RINGS_TX`` u32 size of TX ring ``ETHTOOL_A_RINGS_RX_BUF_LEN`` u32 size of buffers on the ring + ``ETHTOOL_A_RINGS_TCP_DATA_SPLIT`` u8 TCP header / data split ``ETHTOOL_A_RINGS_CQE_SIZE`` u32 Size of TX/RX CQE ``ETHTOOL_A_RINGS_TX_PUSH`` u8 flag of TX Push mode ``ETHTOOL_A_RINGS_RX_PUSH`` u8 flag of RX Push mode ``ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN`` u32 size of TX push buffer + ``ETHTOOL_A_RINGS_HDS_THRESH`` u32 threshold of header / data split ==================================== ====== =========================== Kernel checks that requested ring sizes do not exceed limits reported by @@ -957,6 +967,10 @@ A bigger CQE can have more receive buffer pointers, and in turn the NIC can transfer a bigger frame from wire. Based on the NIC hardware, the overall completion queue size can be adjusted in the driver if CQE size is modified. +``ETHTOOL_A_RINGS_HDS_THRESH`` specifies the threshold value of +header / data split feature. If a received packet size is larger than this +threshold value, header and data will be split. + CHANNELS_GET ============ @@ -1245,9 +1259,10 @@ Gets timestamping information like ``ETHTOOL_GET_TS_INFO`` ioctl request. Request contents: - ===================================== ====== ========================== - ``ETHTOOL_A_TSINFO_HEADER`` nested request header - ===================================== ====== ========================== + ======================================== ====== ============================ + ``ETHTOOL_A_TSINFO_HEADER`` nested request header + ``ETHTOOL_A_TSINFO_HWTSTAMP_PROVIDER`` nested PTP hw clock provider + ======================================== ====== ============================ Kernel response contents: @@ -1266,11 +1281,17 @@ would be empty (no bit set). Additional hardware timestamping statistics response contents: - ===================================== ====== =================================== - ``ETHTOOL_A_TS_STAT_TX_PKTS`` uint Packets with Tx HW timestamps - ``ETHTOOL_A_TS_STAT_TX_LOST`` uint Tx HW timestamp not arrived count - ``ETHTOOL_A_TS_STAT_TX_ERR`` uint HW error request Tx timestamp count - ===================================== ====== =================================== + ================================================== ====== ===================== + ``ETHTOOL_A_TS_STAT_TX_PKTS`` uint Packets with Tx + HW timestamps + ``ETHTOOL_A_TS_STAT_TX_LOST`` uint Tx HW timestamp + not arrived count + ``ETHTOOL_A_TS_STAT_TX_ERR`` uint HW error request + Tx timestamp count + ``ETHTOOL_A_TS_STAT_TX_ONESTEP_PKTS_UNCONFIRMED`` uint Packets with one-step + HW TX timestamps with + unconfirmed delivery + ================================================== ====== ===================== CABLE_TEST ========== @@ -1611,6 +1632,7 @@ the ``ETHTOOL_A_STATS_GROUPS`` bitset. Currently defined values are: ETHTOOL_STATS_ETH_PHY eth-phy Basic IEEE 802.3 PHY statistics (30.3.2.1.*) ETHTOOL_STATS_ETH_CTRL eth-ctrl Basic IEEE 802.3 MAC Ctrl statistics (30.3.3.*) ETHTOOL_STATS_RMON rmon RMON (RFC 2819) statistics + ETHTOOL_STATS_PHY phy Additional PHY statistics, not defined by IEEE ====================== ======== =============================================== Each group should have a corresponding ``ETHTOOL_A_STATS_GRP`` in the reply. @@ -2243,6 +2265,75 @@ Kernel response contents: When ``ETHTOOL_A_PHY_UPSTREAM_TYPE`` is PHY_UPSTREAM_PHY, the PHY's parent is another PHY. +TSCONFIG_GET +============ + +Retrieves the information about the current hardware timestamping source and +configuration. + +It is similar to the deprecated ``SIOCGHWTSTAMP`` ioctl request. + +Request contents: + + ==================================== ====== ========================== + ``ETHTOOL_A_TSCONFIG_HEADER`` nested request header + ==================================== ====== ========================== + +Kernel response contents: + + ======================================== ====== ============================ + ``ETHTOOL_A_TSCONFIG_HEADER`` nested request header + ``ETHTOOL_A_TSCONFIG_HWTSTAMP_PROVIDER`` nested PTP hw clock provider + ``ETHTOOL_A_TSCONFIG_TX_TYPES`` bitset hwtstamp Tx type + ``ETHTOOL_A_TSCONFIG_RX_FILTERS`` bitset hwtstamp Rx filter + ``ETHTOOL_A_TSCONFIG_HWTSTAMP_FLAGS`` u32 hwtstamp flags + ======================================== ====== ============================ + +When set the ``ETHTOOL_A_TSCONFIG_HWTSTAMP_PROVIDER`` attribute identifies the +source of the hw timestamping provider. It is composed by +``ETHTOOL_A_TS_HWTSTAMP_PROVIDER_INDEX`` attribute which describe the index of +the PTP device and ``ETHTOOL_A_TS_HWTSTAMP_PROVIDER_QUALIFIER`` which describe +the qualifier of the timestamp. + +When set the ``ETHTOOL_A_TSCONFIG_TX_TYPES``, ``ETHTOOL_A_TSCONFIG_RX_FILTERS`` +and the ``ETHTOOL_A_TSCONFIG_HWTSTAMP_FLAGS`` attributes identify the Tx +type, the Rx filter and the flags configured for the current hw timestamping +provider. The attributes are propagated to the driver through the following +structure: + +.. kernel-doc:: include/linux/net_tstamp.h + :identifiers: kernel_hwtstamp_config + +TSCONFIG_SET +============ + +Set the information about the current hardware timestamping source and +configuration. + +It is similar to the deprecated ``SIOCSHWTSTAMP`` ioctl request. + +Request contents: + + ======================================== ====== ============================ + ``ETHTOOL_A_TSCONFIG_HEADER`` nested request header + ``ETHTOOL_A_TSCONFIG_HWTSTAMP_PROVIDER`` nested PTP hw clock provider + ``ETHTOOL_A_TSCONFIG_TX_TYPES`` bitset hwtstamp Tx type + ``ETHTOOL_A_TSCONFIG_RX_FILTERS`` bitset hwtstamp Rx filter + ``ETHTOOL_A_TSCONFIG_HWTSTAMP_FLAGS`` u32 hwtstamp flags + ======================================== ====== ============================ + +Kernel response contents: + + ======================================== ====== ============================ + ``ETHTOOL_A_TSCONFIG_HEADER`` nested request header + ``ETHTOOL_A_TSCONFIG_HWTSTAMP_PROVIDER`` nested PTP hw clock provider + ``ETHTOOL_A_TSCONFIG_TX_TYPES`` bitset hwtstamp Tx type + ``ETHTOOL_A_TSCONFIG_RX_FILTERS`` bitset hwtstamp Rx filter + ``ETHTOOL_A_TSCONFIG_HWTSTAMP_FLAGS`` u32 hwtstamp flags + ======================================== ====== ============================ + +For a description of each attribute, see ``TSCONFIG_GET``. + Request translation =================== @@ -2351,4 +2442,6 @@ are netlink only. n/a ``ETHTOOL_MSG_MM_SET`` n/a ``ETHTOOL_MSG_MODULE_FW_FLASH_ACT`` n/a ``ETHTOOL_MSG_PHY_GET`` + ``SIOCGHWTSTAMP`` ``ETHTOOL_MSG_TSCONFIG_GET`` + ``SIOCSHWTSTAMP`` ``ETHTOOL_MSG_TSCONFIG_SET`` =================================== ===================================== diff --git a/Documentation/networking/ieee802154.rst b/Documentation/networking/ieee802154.rst index c652d383fe10..743c0a80e309 100644 --- a/Documentation/networking/ieee802154.rst +++ b/Documentation/networking/ieee802154.rst @@ -72,7 +72,8 @@ exports a management (e.g. MLME) and data API. possibly with some kinds of acceleration like automatic CRC computation and comparison, automagic ACK handling, address matching, etc. -Those types of devices require different approach to be hooked into Linux kernel. +Each type of device requires a different approach to be hooked into the Linux +kernel. HardMAC ------- @@ -81,10 +82,10 @@ See the header include/net/ieee802154_netdev.h. You have to implement Linux net_device, with .type = ARPHRD_IEEE802154. Data is exchanged with socket family code via plain sk_buffs. On skb reception skb->cb must contain additional info as described in the struct ieee802154_mac_cb. During packet transmission -the skb->cb is used to provide additional data to device's header_ops->create -function. Be aware that this data can be overridden later (when socket code -submits skb to qdisc), so if you need something from that cb later, you should -store info in the skb->data on your own. +the skb->cb is used to provide additional data to the device's +header_ops->create function. Be aware that this data can be overridden later +(when socket code submits skb to qdisc), so if you need something from that cb +later, you should store info in the skb->data on your own. To hook the MLME interface you have to populate the ml_priv field of your net_device with a pointer to struct ieee802154_mlme_ops instance. The fields @@ -94,8 +95,9 @@ All other fields are required. SoftMAC ------- -The MAC is the middle layer in the IEEE 802.15.4 Linux stack. This moment it -provides interface for drivers registration and management of slave interfaces. +The MAC is the middle layer in the IEEE 802.15.4 Linux stack. At the moment, it +provides an interface for driver registration and management of slave +interfaces. NOTE: Currently the only monitor device type is supported - it's IEEE 802.15.4 stack interface for network sniffers (e.g. WireShark). diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 46c178e564b3..058193ed2eeb 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -86,6 +86,7 @@ Contents: netdevices netfilter-sysctl netif-msg + netmem nexthop-group-resilient nf_conntrack-sysctl nf_flowtable diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst index dcbb6f6caf6d..363b4950d542 100644 --- a/Documentation/networking/ip-sysctl.rst +++ b/Documentation/networking/ip-sysctl.rst @@ -1000,6 +1000,20 @@ tcp_tw_reuse - INTEGER Default: 2 +tcp_tw_reuse_delay - UNSIGNED INTEGER + The delay in milliseconds before a TIME-WAIT socket can be reused by a + new connection, if TIME-WAIT socket reuse is enabled. The actual reuse + threshold is within [N, N+1] range, where N is the requested delay in + milliseconds, to ensure the delay interval is never shorter than the + configured value. + + This setting contains an assumption about the other TCP timestamp clock + tick interval. It should not be set to a value lower than the peer's + clock tick for PAWS (Protection Against Wrapped Sequence numbers) + mechanism work correctly for the reused connection. + + Default: 1000 (milliseconds) + tcp_window_scaling - BOOLEAN Enable window scaling as defined in RFC1323. diff --git a/Documentation/networking/mptcp-sysctl.rst b/Documentation/networking/mptcp-sysctl.rst index 95598c21fc8e..dc45c0211353 100644 --- a/Documentation/networking/mptcp-sysctl.rst +++ b/Documentation/networking/mptcp-sysctl.rst @@ -108,3 +108,19 @@ stale_loss_cnt - INTEGER This is a per-namespace sysctl. Default: 4 + +syn_retrans_before_tcp_fallback - INTEGER + The number of SYN + MP_CAPABLE retransmissions before falling back to + TCP, i.e. dropping the MPTCP options. In other words, if all the packets + are dropped on the way, there will be: + + * The initial SYN with MPTCP support + * This number of SYN retransmitted with MPTCP support + * The next SYN retransmissions will be without MPTCP support + + 0 means the first retransmission will be done without MPTCP options. + >= 128 means that all SYN retransmissions will keep the MPTCP options. A + lower number might increase false-positive MPTCP blackholes detections. + This is a per-namespace sysctl. + + Default: 2 diff --git a/Documentation/networking/multi-pf-netdev.rst b/Documentation/networking/multi-pf-netdev.rst index 2cd25d81aaa7..2f5a5bb3ca9a 100644 --- a/Documentation/networking/multi-pf-netdev.rst +++ b/Documentation/networking/multi-pf-netdev.rst @@ -89,7 +89,7 @@ Observability ============= The relation between PF, irq, napi, and queue can be observed via netlink spec:: - $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml --dump queue-get --json='{"ifindex": 13}' + $ ./tools/net/ynl/pyynl/cli.py --spec Documentation/netlink/specs/netdev.yaml --dump queue-get --json='{"ifindex": 13}' [{'id': 0, 'ifindex': 13, 'napi-id': 539, 'type': 'rx'}, {'id': 1, 'ifindex': 13, 'napi-id': 540, 'type': 'rx'}, {'id': 2, 'ifindex': 13, 'napi-id': 541, 'type': 'rx'}, @@ -101,7 +101,7 @@ The relation between PF, irq, napi, and queue can be observed via netlink spec:: {'id': 3, 'ifindex': 13, 'napi-id': 542, 'type': 'tx'}, {'id': 4, 'ifindex': 13, 'napi-id': 543, 'type': 'tx'}] - $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml --dump napi-get --json='{"ifindex": 13}' + $ ./tools/net/ynl/pyynl/cli.py --spec Documentation/netlink/specs/netdev.yaml --dump napi-get --json='{"ifindex": 13}' [{'id': 543, 'ifindex': 13, 'irq': 42}, {'id': 542, 'ifindex': 13, 'irq': 41}, {'id': 541, 'ifindex': 13, 'irq': 40}, diff --git a/Documentation/networking/napi.rst b/Documentation/networking/napi.rst index 02720dd71a76..6083210ab2a4 100644 --- a/Documentation/networking/napi.rst +++ b/Documentation/networking/napi.rst @@ -199,13 +199,13 @@ parameters mentioned above use hyphens instead of underscores: Per-NAPI configuration can be done programmatically in a user application or by using a script included in the kernel source tree: -``tools/net/ynl/cli.py``. +``tools/net/ynl/pyynl/cli.py``. For example, using the script: .. code-block:: bash - $ kernel-source/tools/net/ynl/cli.py \ + $ kernel-source/tools/net/ynl/pyynl/cli.py \ --spec Documentation/netlink/specs/netdev.yaml \ --do napi-set \ --json='{"id": 345, diff --git a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst index 629da6dc6d74..de0263302f16 100644 --- a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst +++ b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst @@ -79,6 +79,7 @@ u8 sysctl_tcp_retries1 u8 sysctl_tcp_retries2 u8 sysctl_tcp_orphan_retries u8 sysctl_tcp_tw_reuse timewait_sock_ops +unsigned_int sysctl_tcp_tw_reuse_delay timewait_sock_ops int sysctl_tcp_fin_timeout TCP_LAST_ACK/tcp_rcv_state_process unsigned_int sysctl_tcp_notsent_lowat read_mostly tcp_notsent_lowat/tcp_stream_memory_free u8 sysctl_tcp_sack tcp_syn_options diff --git a/Documentation/networking/netconsole.rst b/Documentation/networking/netconsole.rst index d55c2a22ec7a..94c4680fdf3e 100644 --- a/Documentation/networking/netconsole.rst +++ b/Documentation/networking/netconsole.rst @@ -124,7 +124,7 @@ To remove a target:: The interface exposes these parameters of a netconsole target to userspace: - ============== ================================= ============ + =============== ================================= ============ enabled Is this target currently enabled? (read-write) extended Extended mode enabled (read-write) release Prepend kernel release to message (read-write) @@ -135,7 +135,8 @@ The interface exposes these parameters of a netconsole target to userspace: remote_ip Remote agent's IP address (read-write) local_mac Local interface's MAC address (read-only) remote_mac Remote agent's MAC address (read-write) - ============== ================================= ============ + transmit_errors Number of packet send errors (read-only) + =============== ================================= ============ The "enabled" attribute is also used to control whether the parameters of a target can be updated or not -- you can modify the parameters of only diff --git a/Documentation/networking/netdevices.rst b/Documentation/networking/netdevices.rst index 857c9784f87e..1d37038e9fbe 100644 --- a/Documentation/networking/netdevices.rst +++ b/Documentation/networking/netdevices.rst @@ -297,3 +297,13 @@ napi->poll: Context: softirq will be called with interrupts disabled by netconsole. + +NETDEV_INTERNAL symbol namespace +================================ + +Symbols exported as NETDEV_INTERNAL can only be used in networking +core and drivers which exclusively flow via the main networking list and trees. +Note that the inverse is not true, most symbols outside of NETDEV_INTERNAL +are not expected to be used by random code outside netdev either. +Symbols may lack the designation because they predate the namespaces, +or simply due to an oversight. diff --git a/Documentation/networking/netlink_spec/readme.txt b/Documentation/networking/netlink_spec/readme.txt index 6763f99d216c..030b44aca4e6 100644 --- a/Documentation/networking/netlink_spec/readme.txt +++ b/Documentation/networking/netlink_spec/readme.txt @@ -1,4 +1,4 @@ SPDX-License-Identifier: GPL-2.0 This file is populated during the build of the documentation (htmldocs) by the -tools/net/ynl/ynl-gen-rst.py script. +tools/net/ynl/pyynl/ynl_gen_rst.py script. diff --git a/Documentation/networking/netmem.rst b/Documentation/networking/netmem.rst new file mode 100644 index 000000000000..7de21ddb5412 --- /dev/null +++ b/Documentation/networking/netmem.rst @@ -0,0 +1,79 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================== +Netmem Support for Network Drivers +================================== + +This document outlines the requirements for network drivers to support netmem, +an abstract memory type that enables features like device memory TCP. By +supporting netmem, drivers can work with various underlying memory types +with little to no modification. + +Benefits of Netmem : + +* Flexibility: Netmem can be backed by different memory types (e.g., struct + page, DMA-buf), allowing drivers to support various use cases such as device + memory TCP. +* Future-proof: Drivers with netmem support are ready for upcoming + features that rely on it. +* Simplified Development: Drivers interact with a consistent API, + regardless of the underlying memory implementation. + +Driver Requirements +=================== + +1. The driver must support page_pool. + +2. The driver must support the tcp-data-split ethtool option. + +3. The driver must use the page_pool netmem APIs for payload memory. The netmem + APIs currently 1-to-1 correspond with page APIs. Conversion to netmem should + be achievable by switching the page APIs to netmem APIs and tracking memory + via netmem_refs in the driver rather than struct page * : + + - page_pool_alloc -> page_pool_alloc_netmem + - page_pool_get_dma_addr -> page_pool_get_dma_addr_netmem + - page_pool_put_page -> page_pool_put_netmem + + Not all page APIs have netmem equivalents at the moment. If your driver + relies on a missing netmem API, feel free to add and propose to netdev@, or + reach out to the maintainers and/or almasrymina@google.com for help adding + the netmem API. + +4. The driver must use the following PP_FLAGS: + + - PP_FLAG_DMA_MAP: netmem is not dma-mappable by the driver. The driver + must delegate the dma mapping to the page_pool, which knows when + dma-mapping is (or is not) appropriate. + - PP_FLAG_DMA_SYNC_DEV: netmem dma addr is not necessarily dma-syncable + by the driver. The driver must delegate the dma syncing to the page_pool, + which knows when dma-syncing is (or is not) appropriate. + - PP_FLAG_ALLOW_UNREADABLE_NETMEM. The driver must specify this flag iff + tcp-data-split is enabled. + +5. The driver must not assume the netmem is readable and/or backed by pages. + The netmem returned by the page_pool may be unreadable, in which case + netmem_address() will return NULL. The driver must correctly handle + unreadable netmem, i.e. don't attempt to handle its contents when + netmem_address() is NULL. + + Ideally, drivers should not have to check the underlying netmem type via + helpers like netmem_is_net_iov() or convert the netmem to any of its + underlying types via netmem_to_page() or netmem_to_net_iov(). In most cases, + netmem or page_pool helpers that abstract this complexity are provided + (and more can be added). + +6. The driver must use page_pool_dma_sync_netmem_for_cpu() in lieu of + dma_sync_single_range_for_cpu(). For some memory providers, dma_syncing for + CPU will be done by the page_pool, for others (particularly dmabuf memory + provider), dma syncing for CPU is the responsibility of the userspace using + dmabuf APIs. The driver must delegate the entire dma-syncing operation to + the page_pool which will do it correctly. + +7. Avoid implementing driver-specific recycling on top of the page_pool. Drivers + cannot hold onto a struct page to do their own recycling as the netmem may + not be backed by a struct page. However, you may hold onto a page_pool + reference with page_pool_fragment_netmem() or page_pool_ref_netmem() for + that purpose, but be mindful that some netmem types might have longer + circulation times, such as when userspace holds a reference in zerocopy + scenarios. diff --git a/Documentation/networking/timestamping.rst b/Documentation/networking/timestamping.rst index b37bfbfc7d79..61ef9da10e28 100644 --- a/Documentation/networking/timestamping.rst +++ b/Documentation/networking/timestamping.rst @@ -525,8 +525,8 @@ implicitly defined. ts[0] holds a software timestamp if set, ts[1] is again deprecated and ts[2] holds a hardware timestamp if set. -3. Hardware Timestamping configuration: SIOCSHWTSTAMP and SIOCGHWTSTAMP -======================================================================= +3. Hardware Timestamping configuration: ETHTOOL_MSG_TSCONFIG_SET/GET +==================================================================== Hardware time stamping must also be initialized for each device driver that is expected to do hardware time stamping. The parameter is defined in @@ -539,12 +539,14 @@ include/uapi/linux/net_tstamp.h as:: }; Desired behavior is passed into the kernel and to a specific device by -calling ioctl(SIOCSHWTSTAMP) with a pointer to a struct ifreq whose -ifr_data points to a struct hwtstamp_config. The tx_type and -rx_filter are hints to the driver what it is expected to do. If -the requested fine-grained filtering for incoming packets is not -supported, the driver may time stamp more than just the requested types -of packets. +calling the tsconfig netlink socket ``ETHTOOL_MSG_TSCONFIG_SET``. +The ``ETHTOOL_A_TSCONFIG_TX_TYPES``, ``ETHTOOL_A_TSCONFIG_RX_FILTERS`` and +``ETHTOOL_A_TSCONFIG_HWTSTAMP_FLAGS`` netlink attributes are then used to set +the struct hwtstamp_config accordingly. + +The ``ETHTOOL_A_TSCONFIG_HWTSTAMP_PROVIDER`` netlink nested attribute is used +to select the source of the hardware time stamping. It is composed of an index +for the device source and a qualifier for the type of time stamping. Drivers are free to use a more permissive configuration than the requested configuration. It is expected that drivers should only implement directly the @@ -563,9 +565,16 @@ Only a processes with admin rights may change the configuration. User space is responsible to ensure that multiple processes don't interfere with each other and that the settings are reset. -Any process can read the actual configuration by passing this -structure to ioctl(SIOCGHWTSTAMP) in the same way. However, this has -not been implemented in all drivers. +Any process can read the actual configuration by requesting tsconfig netlink +socket ``ETHTOOL_MSG_TSCONFIG_GET``. + +The legacy configuration is the use of the ioctl(SIOCSHWTSTAMP) with a pointer +to a struct ifreq whose ifr_data points to a struct hwtstamp_config. +The tx_type and rx_filter are hints to the driver what it is expected to do. +If the requested fine-grained filtering for incoming packets is not +supported, the driver may time stamp more than just the requested types +of packets. ioctl(SIOCGHWTSTAMP) is used in the same way as the +ioctl(SIOCSHWTSTAMP). However, this has not been implemented in all drivers. :: @@ -610,9 +619,10 @@ not been implemented in all drivers. -------------------------------------------------------- A driver which supports hardware time stamping must support the -SIOCSHWTSTAMP ioctl and update the supplied struct hwtstamp_config with -the actual values as described in the section on SIOCSHWTSTAMP. It -should also support SIOCGHWTSTAMP. +ndo_hwtstamp_set NDO or the legacy SIOCSHWTSTAMP ioctl and update the +supplied struct hwtstamp_config with the actual values as described in +the section on SIOCSHWTSTAMP. It should also support ndo_hwtstamp_get or +the legacy SIOCGHWTSTAMP. Time stamps for received packets must be stored in the skb. To get a pointer to the shared time stamp structure of the skb call skb_hwtstamps(). Then diff --git a/Documentation/networking/tls.rst b/Documentation/networking/tls.rst index 658ed3a71e1b..c7904a1bc167 100644 --- a/Documentation/networking/tls.rst +++ b/Documentation/networking/tls.rst @@ -200,6 +200,32 @@ received without a cmsg buffer set. recv will never return data from mixed types of TLS records. +TLS 1.3 Key Updates +------------------- + +In TLS 1.3, KeyUpdate handshake messages signal that the sender is +updating its TX key. Any message sent after a KeyUpdate will be +encrypted using the new key. The userspace library can pass the new +key to the kernel using the TLS_TX and TLS_RX socket options, as for +the initial keys. TLS version and cipher cannot be changed. + +To prevent attempting to decrypt incoming records using the wrong key, +decryption will be paused when a KeyUpdate message is received by the +kernel, until the new key has been provided using the TLS_RX socket +option. Any read occurring after the KeyUpdate has been read and +before the new key is provided will fail with EKEYEXPIRED. poll() will +not report any read events from the socket until the new key is +provided. There is no pausing on the transmit side. + +Userspace should make sure that the crypto_info provided has been set +properly. In particular, the kernel will not check for key/nonce +reuse. + +The number of successful and failed key updates is tracked in the +``TlsTxRekeyOk``, ``TlsRxRekeyOk``, ``TlsTxRekeyError``, +``TlsRxRekeyError`` statistics. The ``TlsRxRekeyReceived`` statistic +counts KeyUpdate handshake messages that have been received. + Integrating in to userspace TLS library --------------------------------------- @@ -286,3 +312,13 @@ TLS implementation exposes the following per-namespace statistics - ``TlsRxNoPadViolation`` - number of data RX records which had to be re-decrypted due to ``TLS_RX_EXPECT_NO_PAD`` mis-prediction. + +- ``TlsTxRekeyOk``, ``TlsRxRekeyOk`` - + number of successful rekeys on existing sessions for TX and RX + +- ``TlsTxRekeyError``, ``TlsRxRekeyError`` - + number of failed rekeys on existing sessions for TX and RX + +- ``TlsRxRekeyReceived`` - + number of received KeyUpdate handshake messages, requiring userspace + to provide a new RX key diff --git a/Documentation/networking/xfrm_device.rst b/Documentation/networking/xfrm_device.rst index bfea9d8579ed..66f6e9a9b59a 100644 --- a/Documentation/networking/xfrm_device.rst +++ b/Documentation/networking/xfrm_device.rst @@ -169,7 +169,8 @@ the stack in xfrm_input(). hand the packet to napi_gro_receive() as usual -In ESN mode, xdo_dev_state_advance_esn() is called from xfrm_replay_advance_esn(). +In ESN mode, xdo_dev_state_advance_esn() is called from +xfrm_replay_advance_esn() for RX, and xfrm_replay_overflow_offload_esn for TX. Driver will check packet seq number and update HW ESN state machine if needed. Packet offload mode: |