diff options
Diffstat (limited to 'Documentation/networking')
30 files changed, 778 insertions, 168 deletions
diff --git a/Documentation/networking/bonding.rst b/Documentation/networking/bonding.rst index f8f5766703d4..e700bf1d095c 100644 --- a/Documentation/networking/bonding.rst +++ b/Documentation/networking/bonding.rst @@ -193,6 +193,15 @@ ad_actor_sys_prio This parameter has effect only in 802.3ad mode and is available through SysFs interface. +actor_port_prio + + In an AD system, this specifies the port priority. The allowed range + is 1 - 65535. If the value is not specified, it takes 255 as the + default value. + + This parameter has effect only in 802.3ad mode and is available through + netlink interface. + ad_actor_system In an AD system, this specifies the mac-address for the actor in @@ -241,10 +250,18 @@ ad_select ports (slaves). Reselection occurs as described under the "bandwidth" setting, above. - The bandwidth and count selection policies permit failover of - 802.3ad aggregations when partial failure of the active aggregator - occurs. This keeps the aggregator with the highest availability - (either in bandwidth or in number of ports) active at all times. + actor_port_prio or 3 + + The active aggregator is chosen by the highest total sum of + actor port priorities across its active ports. Note this + priority is actor_port_prio, not per port prio, which is + used for primary reselect. + + The bandwidth, count and actor_port_prio selection policies permit + failover of 802.3ad aggregations when partial failure of the active + aggregator occurs. This keeps the aggregator with the highest + availability (either in bandwidth, number of ports, or total value + of port priorities) active at all times. This option was added in bonding version 3.4.0. @@ -582,10 +599,8 @@ miimon This determines how often the link state of each slave is inspected for link failures. A value of zero disables MII link monitoring. A value of 100 is a good starting point. - The use_carrier option, below, affects how the link state is - determined. See the High Availability section for additional - information. The default value is 100 if arp_interval is not - set. + + The default value is 100 if arp_interval is not set. min_links @@ -896,25 +911,14 @@ updelay use_carrier - Specifies whether or not miimon should use MII or ETHTOOL - ioctls vs. netif_carrier_ok() to determine the link - status. The MII or ETHTOOL ioctls are less efficient and - utilize a deprecated calling sequence within the kernel. The - netif_carrier_ok() relies on the device driver to maintain its - state with netif_carrier_on/off; at this writing, most, but - not all, device drivers support this facility. - - If bonding insists that the link is up when it should not be, - it may be that your network device driver does not support - netif_carrier_on/off. The default state for netif_carrier is - "carrier on," so if a driver does not support netif_carrier, - it will appear as if the link is always up. In this case, - setting use_carrier to 0 will cause bonding to revert to the - MII / ETHTOOL ioctl method to determine the link state. - - A value of 1 enables the use of netif_carrier_ok(), a value of - 0 will use the deprecated MII / ETHTOOL ioctls. The default - value is 1. + Obsolete option that previously selected between MII / + ETHTOOL ioctls and netif_carrier_ok() to determine link + state. + + All link state checks are now done with netif_carrier_ok(). + + For backwards compatibility, this option's value may be inspected + or set. The only valid setting is 1. xmit_hash_policy @@ -2036,22 +2040,8 @@ depending upon the device driver to maintain its carrier state, by querying the device's MII registers, or by making an ethtool query to the device. -If the use_carrier module parameter is 1 (the default value), -then the MII monitor will rely on the driver for carrier state -information (via the netif_carrier subsystem). As explained in the -use_carrier parameter information, above, if the MII monitor fails to -detect carrier loss on the device (e.g., when the cable is physically -disconnected), it may be that the driver does not support -netif_carrier. - -If use_carrier is 0, then the MII monitor will first query the -device's (via ioctl) MII registers and check the link state. If that -request fails (not just that it returns carrier down), then the MII -monitor will make an ethtool ETHTOOL_GLINK request to attempt to obtain -the same information. If both methods fail (i.e., the driver either -does not support or had some error in processing both the MII register -and ethtool requests), then the MII monitor will assume the link is -up. +The MII monitor relies on the driver for carrier state information (via +the netif_carrier subsystem). 8. Potential Sources of Trouble =============================== @@ -2135,34 +2125,6 @@ This will load tg3 and e1000 modules before loading the bonding one. Full documentation on this can be found in the modprobe.d and modprobe manual pages. -8.3. Painfully Slow Or No Failed Link Detection By Miimon ---------------------------------------------------------- - -By default, bonding enables the use_carrier option, which -instructs bonding to trust the driver to maintain carrier state. - -As discussed in the options section, above, some drivers do -not support the netif_carrier_on/_off link state tracking system. -With use_carrier enabled, bonding will always see these links as up, -regardless of their actual state. - -Additionally, other drivers do support netif_carrier, but do -not maintain it in real time, e.g., only polling the link state at -some fixed interval. In this case, miimon will detect failures, but -only after some long period of time has expired. If it appears that -miimon is very slow in detecting link failures, try specifying -use_carrier=0 to see if that improves the failure detection time. If -it does, then it may be that the driver checks the carrier state at a -fixed interval, but does not cache the MII register values (so the -use_carrier=0 method of querying the registers directly works). If -use_carrier=0 does not improve the failover, then the driver may cache -the registers, or the problem may be elsewhere. - -Also, remember that miimon only checks for the device's -carrier state. It has no way to determine the state of devices on or -beyond other ports of a switch, or if a switch is refusing to pass -traffic while still maintaining carrier on. - 9. SNMP agents =============== diff --git a/Documentation/networking/can.rst b/Documentation/networking/can.rst index 7650c4b5be5f..f93049f03a37 100644 --- a/Documentation/networking/can.rst +++ b/Documentation/networking/can.rst @@ -539,7 +539,7 @@ CAN Filter Usage Optimisation The CAN filters are processed in per-device filter lists at CAN frame reception time. To reduce the number of checks that need to be performed while walking through the filter lists the CAN core provides an optimized -filter handling when the filter subscription focusses on a single CAN ID. +filter handling when the filter subscription focuses on a single CAN ID. For the possible 2048 SFF CAN identifiers the identifier is used as an index to access the corresponding subscription list without any further checks. diff --git a/Documentation/networking/device_drivers/ethernet/index.rst b/Documentation/networking/device_drivers/ethernet/index.rst index 40ac552641a3..7cfcd183054f 100644 --- a/Documentation/networking/device_drivers/ethernet/index.rst +++ b/Documentation/networking/device_drivers/ethernet/index.rst @@ -50,6 +50,8 @@ Contents: neterion/s2io netronome/nfp pensando/ionic + pensando/ionic_rdma + qualcomm/ppe/ppe smsc/smc9 stmicro/stmmac ti/cpsw diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst index 754c81436408..cc498895f92e 100644 --- a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst +++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst @@ -1348,7 +1348,7 @@ Device Counters is in a congested state. If pci_bw_inbound_high == pci_bw_inbound_low then the device is not congested. If pci_bw_inbound_high > pci_bw_inbound_low then the device is congested. - - Tnformative + - Informative * - `pci_bw_inbound_low` - The number of times the device crossed the low inbound PCIe bandwidth @@ -1373,3 +1373,8 @@ Device Counters If pci_bw_outbound_high == pci_bw_outbound_low then the device is not congested. If pci_bw_outbound_high > pci_bw_outbound_low then the device is congested. - Informative + + * - `pci_bw_stale_event` + - The number of times the device fired a PCIe congestion event but on query + there was no change in state. + - Informative diff --git a/Documentation/networking/device_drivers/ethernet/meta/fbnic.rst b/Documentation/networking/device_drivers/ethernet/meta/fbnic.rst index afb8353daefd..1e82f90d9ad2 100644 --- a/Documentation/networking/device_drivers/ethernet/meta/fbnic.rst +++ b/Documentation/networking/device_drivers/ethernet/meta/fbnic.rst @@ -69,6 +69,25 @@ On host boot the latest UEFI driver is always used, no explicit activation is required. Firmware activation is required to run new control firmware. cmrt firmware can only be activated by power cycling the NIC. +Health reporters +---------------- + +fw reporter +~~~~~~~~~~~ + +The ``fw`` health reporter tracks FW crashes. Dumping the reporter will +show the core dump of the most recent FW crash, and if no FW crash has +happened since power cycle - a snapshot of the FW memory. Diagnose callback +shows FW uptime based on the most recently received heartbeat message +(the crashes are detected by checking if uptime goes down). + +otp reporter +~~~~~~~~~~~~ + +OTP memory ("fuses") are used for secure boot and anti-rollback +protection. The OTP memory is ECC protected, ECC errors indicate +either manufacturing defect or part deteriorating with age. + Statistics ---------- @@ -160,3 +179,14 @@ behavior and potential performance bottlenecks. credit exhaustion - ``pcie_ob_rd_no_np_cred``: Read requests dropped due to non-posted credit exhaustion + +XDP Length Error: +~~~~~~~~~~~~~~~~~ + +For XDP programs without frags support, fbnic tries to make sure that MTU fits +into a single buffer. If an oversized frame is received and gets fragmented, +it is dropped and the following netlink counters are updated + + - ``rx-length``: number of frames dropped due to lack of fragmentation + support in the attached XDP program + - ``rx-errors``: total number of packets with errors received on the interface diff --git a/Documentation/networking/device_drivers/ethernet/pensando/ionic.rst b/Documentation/networking/device_drivers/ethernet/pensando/ionic.rst index 05fe2b11bb18..a0029b6db31e 100644 --- a/Documentation/networking/device_drivers/ethernet/pensando/ionic.rst +++ b/Documentation/networking/device_drivers/ethernet/pensando/ionic.rst @@ -13,6 +13,7 @@ Contents - Identifying the Adapter - Enabling the driver - Configuring the driver +- RDMA Support via Auxiliary Device - Statistics - Support @@ -105,6 +106,15 @@ XDP Support for XDP includes the basics, plus Jumbo frames, Redirect and ndo_xmit. There is no current support for zero-copy sockets or HW offload. +RDMA Support via Auxiliary Device +================================= + +The ionic driver supports RDMA (Remote Direct Memory Access) functionality +through the Linux auxiliary device framework when advertised by the firmware. +RDMA capability is detected during device initialization, and if supported, +the ethernet driver will create an auxiliary device that allows the RDMA +driver to bind and provide InfiniBand/RoCE functionality. + Statistics ========== diff --git a/Documentation/networking/device_drivers/ethernet/pensando/ionic_rdma.rst b/Documentation/networking/device_drivers/ethernet/pensando/ionic_rdma.rst new file mode 100644 index 000000000000..42eb461d5f85 --- /dev/null +++ b/Documentation/networking/device_drivers/ethernet/pensando/ionic_rdma.rst @@ -0,0 +1,52 @@ +.. SPDX-License-Identifier: GPL-2.0+ + +=========================================================== +RDMA Driver for the AMD Pensando(R) Ethernet adapter family +=========================================================== + +AMD Pensando RDMA driver. +Copyright (C) 2018-2025, Advanced Micro Devices, Inc. + +Overview +======== + +The ionic_rdma driver provides Remote Direct Memory Access functionality +for AMD Pensando DSC (Distributed Services Card) devices. This driver +implements RDMA capabilities as an auxiliary driver that operates in +conjunction with the ionic ethernet driver. + +The ionic ethernet driver detects RDMA capability during device +initialization and creates auxiliary devices that the ionic_rdma driver +binds to, establishing the RDMA data path and control interfaces. + +Identifying the Adapter +======================= + +See Documentation/networking/device_drivers/ethernet/pensando/ionic.rst +for more information on identifying the adapter. + +Enabling the driver +=================== + +The ionic_rdma driver depends on the ionic ethernet driver. +See Documentation/networking/device_drivers/ethernet/pensando/ionic.rst +for detailed information on enabling and configuring the ionic driver. + +The ionic_rdma driver is enabled via the standard kernel configuration system, +using the make command:: + + make oldconfig/menuconfig/etc. + +The driver is located in the menu structure at: + + -> Device Drivers + -> InfiniBand support + -> AMD Pensando DSC RDMA/RoCE Support + +Support +======= + +For general Linux RDMA support, please use the RDMA mailing +list, which is monitored by AMD Pensando personnel:: + + linux-rdma@vger.kernel.org diff --git a/Documentation/networking/device_drivers/ethernet/qualcomm/ppe/ppe.rst b/Documentation/networking/device_drivers/ethernet/qualcomm/ppe/ppe.rst new file mode 100644 index 000000000000..4ab299a28969 --- /dev/null +++ b/Documentation/networking/device_drivers/ethernet/qualcomm/ppe/ppe.rst @@ -0,0 +1,194 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============================================== +PPE Ethernet Driver for Qualcomm IPQ SoC Family +=============================================== + +Copyright (c) Qualcomm Technologies, Inc. and/or its subsidiaries. + +Author: Lei Wei <quic_leiwei@quicinc.com> + + +Contents +======== + +- `PPE Overview`_ +- `PPE Driver Overview`_ +- `PPE Driver Supported SoCs`_ +- `Enabling the Driver`_ +- `Debugging`_ + + +PPE Overview +============ + +IPQ (Qualcomm Internet Processor) SoC (System-on-Chip) series is Qualcomm's series of +networking SoC for Wi-Fi access points. The PPE (Packet Process Engine) is the Ethernet +packet process engine in the IPQ SoC. + +Below is a simplified hardware diagram of IPQ9574 SoC which includes the PPE engine and +other blocks which are in the SoC but outside the PPE engine. These blocks work together +to enable the Ethernet for the IPQ SoC:: + + +------+ +------+ +------+ +------+ +------+ +------+ start +-------+ + |netdev| |netdev| |netdev| |netdev| |netdev| |netdev|<------|PHYLINK| + +------+ +------+ +------+ +------+ +------+ +------+ stop +-+-+-+-+ + | | | ^ + +-------+ +-------------------------+--------+----------------------+ | | | + | GCC | | | EDMA | | | | | + +---+---+ | PPE +---+----+ | | | | + | clk | | | | | | + +-------->| +-----------------------+------+-----+---------------+ | | | | + | | Switch Core |Port0 | |Port7(EIP FIFO)| | | | | + | | +---+--+ +------+--------+ | | | | + | | | | | | | | | + +-------+ | | +------+---------------+----+ | | | | | + |CMN PLL| | | +---+ +---+ +----+ | +--------+ | | | | | | + +---+---+ | | |BM | |QM | |SCH | | | L2/L3 | ....... | | | | | | + | | | | +---+ +---+ +----+ | +--------+ | | | | | | + | | | | +------+--------------------+ | | | | | + | | | | | | | | | | + | v | | +-----+-+-----+-+-----+-+-+---+--+-----+-+-----+ | | | | | + | +------+ | | |Port1| |Port2| |Port3| |Port4| |Port5| |Port6| | | | | | + | |NSSCC | | | +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ | | mac| | | + | +-+-+--+ | | |MAC0 | |MAC1 | |MAC2 | |MAC3 | |MAC4 | |MAC5 | | |<---+ | | + | ^ | |clk | | +-----+-+-----+-+-----+-+-----+--+-----+-+-----+ | | ops | | + | | | +------>| +----|------|-------|-------|---------|--------|-----+ | | | + | | | +---------------------------------------------------------+ | | + | | | | | | | | | | | + | | | MII clk | QSGMII USXGMII USXGMII | | + | | +--------------->| | | | | | | | + | | +-------------------------+ +---------+ +---------+ | | + | |125/312.5MHz clk| (PCS0) | | (PCS1) | | (PCS2) | pcs ops | | + | +----------------+ UNIPHY0 | | UNIPHY1 | | UNIPHY2 |<--------+ | + +----------------->| | | | | | | + | 31.25MHz ref clk +-------------------------+ +---------+ +---------+ | + | | | | | | | | + | +-----------------------------------------------------+ | + |25/50MHz ref clk| +-------------------------+ +------+ +------+ | link | + +--------------->| | QUAD PHY | | PHY4 | | PHY5 | |---------+ + | +-------------------------+ +------+ +------+ | change + | | + | MDIO bus | + +-----------------------------------------------------+ + +The CMN (Common) PLL, NSSCC (Networking Sub System Clock Controller) and GCC (Global +Clock Controller) blocks are in the SoC and act as clock providers. + +The UNIPHY block is in the SoC and provides the PCS (Physical Coding Sublayer) and +XPCS (10-Gigabit Physical Coding Sublayer) functions to support different interface +modes between the PPE MAC and the external PHY. + +This documentation focuses on the descriptions of PPE engine and the PPE driver. + +The Ethernet functionality in the PPE (Packet Process Engine) is comprised of three +components: the switch core, port wrapper and Ethernet DMA. + +The Switch core in the IPQ9574 PPE has maximum of 6 front panel ports and two FIFO +interfaces. One of the two FIFO interfaces is used for Ethernet port to host CPU +communication using Ethernet DMA. The other one is used to communicate to the EIP +engine which is used for IPsec offload. On the IPQ9574, the PPE includes 6 GMAC/XGMACs +that can be connected with external Ethernet PHY. Switch core also includes BM (Buffer +Management), QM (Queue Management) and SCH (Scheduler) modules for supporting the +packet processing. + +The port wrapper provides connections from the 6 GMAC/XGMACS to UNIPHY (PCS) supporting +various modes such as SGMII/QSGMII/PSGMII/USXGMII/10G-BASER. There are 3 UNIPHY (PCS) +instances supported on the IPQ9574. + +Ethernet DMA is used to transmit and receive packets between the Ethernet subsystem +and ARM host CPU. + +The following lists the main blocks in the PPE engine which will be driven by this +PPE driver: + +- BM + BM is the hardware buffer manager for the PPE switch ports. +- QM + Queue Manager for managing the egress hardware queues of the PPE switch ports. +- SCH + The scheduler which manages the hardware traffic scheduling for the PPE switch ports. +- L2 + The L2 block performs the packet bridging in the switch core. The bridge domain is + represented by the VSI (Virtual Switch Instance) domain in PPE. FDB learning can be + enabled based on the VSI domain and bridge forwarding occurs within the VSI domain. +- MAC + The PPE in the IPQ9574 supports up to six MACs (MAC0 to MAC5) which are corresponding + to six switch ports (port1 to port6). The MAC block is connected with external PHY + through the UNIPHY PCS block. Each MAC block includes the GMAC and XGMAC blocks and + the switch port can select to use GMAC or XMAC through a MUX selection according to + the external PHY's capability. +- EDMA (Ethernet DMA) + The Ethernet DMA is used to transmit and receive Ethernet packets between the PPE + ports and the ARM cores. + +The received packet on a PPE MAC port can be forwarded to another PPE MAC port. It can +be also forwarded to internal switch port0 so that the packet can be delivered to the +ARM cores using the Ethernet DMA (EDMA) engine. The Ethernet DMA driver will deliver the +packet to the corresponding 'netdevice' interface. + +The software instantiations of the PPE MAC (netdevice), PCS and external PHYs interact +with the Linux PHYLINK framework to manage the connectivity between the PPE ports and +the connected PHYs, and the port link states. This is also illustrated in above diagram. + + +PPE Driver Overview +=================== +PPE driver is Ethernet driver for the Qualcomm IPQ SoC. It is a single platform driver +which includes the PPE part and Ethernet DMA part. The PPE part initializes and drives the +various blocks in PPE switch core such as BM/QM/L2 blocks and the PPE MACs. The EDMA part +drives the Ethernet DMA for packet transfer between PPE ports and ARM cores, and enables +the netdevice driver for the PPE ports. + +The PPE driver files in drivers/net/ethernet/qualcomm/ppe/ are listed as below: + +- Makefile +- ppe.c +- ppe.h +- ppe_config.c +- ppe_config.h +- ppe_debugfs.c +- ppe_debugfs.h +- ppe_regs.h + +The ppe.c file contains the main PPE platform driver and undertakes the initialization of +PPE switch core blocks such as QM, BM and L2. The configuration APIs for these hardware +blocks are provided in the ppe_config.c file. + +The ppe.h defines the PPE device data structure which will be used by PPE driver functions. + +The ppe_debugfs.c enables the PPE statistics counters such as PPE port Rx and Tx counters, +CPU code counters and queue counters. + + +PPE Driver Supported SoCs +========================= + +The PPE driver supports the following IPQ SoC: + +- IPQ9574 + + +Enabling the Driver +=================== + +The driver is located in the menu structure at:: + + -> Device Drivers + -> Network device support (NETDEVICES [=y]) + -> Ethernet driver support + -> Qualcomm devices + -> Qualcomm Technologies, Inc. PPE Ethernet support + +If the driver is built as a module, the module will be called qcom-ppe. + +The PPE driver functionally depends on the CMN PLL and NSSCC clock controller drivers. +Please make sure the dependent modules are installed before installing the PPE driver +module. + + +Debugging +========= + +The PPE hardware counters can be accessed using debugfs interface from the +``/sys/kernel/debug/ppe/`` directory. diff --git a/Documentation/networking/device_drivers/ethernet/ti/am65_nuss_cpsw_switchdev.rst b/Documentation/networking/device_drivers/ethernet/ti/am65_nuss_cpsw_switchdev.rst index 25fd9aa284e2..f0424597aac1 100644 --- a/Documentation/networking/device_drivers/ethernet/ti/am65_nuss_cpsw_switchdev.rst +++ b/Documentation/networking/device_drivers/ethernet/ti/am65_nuss_cpsw_switchdev.rst @@ -42,7 +42,7 @@ Port's netdev devices have to be in UP before joining to the bridge to avoid overwriting of bridge configuration as CPSW switch driver completely reloads its configuration when first port changes its state to UP. -When the both interfaces joined the bridge - CPSW switch driver will enable +When both interfaces have joined the bridge - CPSW switch driver will enable marking packets with offload_fwd_mark flag. All configuration is implemented via switchdev API. diff --git a/Documentation/networking/device_drivers/ethernet/ti/cpsw_switchdev.rst b/Documentation/networking/device_drivers/ethernet/ti/cpsw_switchdev.rst index 464dce938ed1..2f3c43a32bfc 100644 --- a/Documentation/networking/device_drivers/ethernet/ti/cpsw_switchdev.rst +++ b/Documentation/networking/device_drivers/ethernet/ti/cpsw_switchdev.rst @@ -92,7 +92,7 @@ Port's netdev devices have to be in UP before joining to the bridge to avoid overwriting of bridge configuration as CPSW switch driver copletly reloads its configuration when first Port changes its state to UP. -When the both interfaces joined the bridge - CPSW switch driver will enable +When both interfaces have joined the bridge - CPSW switch driver will enable marking packets with offload_fwd_mark flag unless "ale_bypass=0" All configuration is implemented via switchdev API. diff --git a/Documentation/networking/devlink/devlink-health.rst b/Documentation/networking/devlink/devlink-health.rst index e0b8cfed610a..4d10536377ab 100644 --- a/Documentation/networking/devlink/devlink-health.rst +++ b/Documentation/networking/devlink/devlink-health.rst @@ -50,7 +50,7 @@ Once an error is reported, devlink health will perform the following actions: * Auto recovery attempt is being done. Depends on: - Auto-recovery configuration - - Grace period vs. time passed since last recover + - Grace period (and burst period) vs. time passed since last recover Devlink formatted message ========================= diff --git a/Documentation/networking/devlink/devlink-params.rst b/Documentation/networking/devlink/devlink-params.rst index 211b58177e12..0a9c20d70122 100644 --- a/Documentation/networking/devlink/devlink-params.rst +++ b/Documentation/networking/devlink/devlink-params.rst @@ -143,3 +143,11 @@ own name. * - ``clock_id`` - u64 - Clock ID used by the device for registering DPLL devices and pins. + * - ``total_vfs`` + - u32 + - The max number of Virtual Functions (VFs) exposed by the PF. + after reboot/pci reset, 'sriov_totalvfs' entry under the device's sysfs + directory will report this value. + * - ``num_doorbells`` + - u32 + - Controls the number of doorbells used by the device. diff --git a/Documentation/networking/devlink/index.rst b/Documentation/networking/devlink/index.rst index 270a65a01411..0c58e5c729d9 100644 --- a/Documentation/networking/devlink/index.rst +++ b/Documentation/networking/devlink/index.rst @@ -56,18 +56,18 @@ general. :maxdepth: 1 devlink-dpipe + devlink-eswitch-attr + devlink-flash devlink-health devlink-info - devlink-flash + devlink-linecard devlink-params devlink-port devlink-region - devlink-resource devlink-reload + devlink-resource devlink-selftests devlink-trap - devlink-linecard - devlink-eswitch-attr Driver-specific documentation ----------------------------- @@ -78,12 +78,14 @@ parameters, info versions, and other features it supports. .. toctree:: :maxdepth: 1 + am65-nuss-cpsw-switch bnxt etas_es58x hns3 i40e - ionic ice + ionic + iosm ixgbe kvaser_pciefd kvaser_usb @@ -93,11 +95,9 @@ parameters, info versions, and other features it supports. mv88e6xxx netdevsim nfp - qed - ti-cpsw-switch - am65-nuss-cpsw-switch - prestera - iosm octeontx2 + prestera + qed sfc + ti-cpsw-switch zl3073x diff --git a/Documentation/networking/devlink/mlx5.rst b/Documentation/networking/devlink/mlx5.rst index 7febe0aecd53..0e5f9c76e514 100644 --- a/Documentation/networking/devlink/mlx5.rst +++ b/Documentation/networking/devlink/mlx5.rst @@ -15,23 +15,62 @@ Parameters * - Name - Mode - Validation + - Notes * - ``enable_roce`` - driverinit - - Type: Boolean - - If the device supports RoCE disablement, RoCE enablement state controls + - Boolean + - If the device supports RoCE disablement, RoCE enablement state controls device support for RoCE capability. Otherwise, the control occurs in the driver stack. When RoCE is disabled at the driver level, only raw ethernet QPs are supported. * - ``io_eq_size`` - driverinit - The range is between 64 and 4096. + - * - ``event_eq_size`` - driverinit - The range is between 64 and 4096. + - * - ``max_macs`` - driverinit - The range is between 1 and 2^31. Only power of 2 values are supported. + - + * - ``enable_sriov`` + - permanent + - Boolean + - Applies to each physical function (PF) independently, if the device + supports it. Otherwise, it applies symmetrically to all PFs. + * - ``total_vfs`` + - permanent + - The range is between 1 and a device-specific max. + - Applies to each physical function (PF) independently, if the device + supports it. Otherwise, it applies symmetrically to all PFs. + +Note: permanent parameters such as ``enable_sriov`` and ``total_vfs`` require FW reset to take effect + +.. code-block:: bash + + # setup parameters + devlink dev param set pci/0000:01:00.0 name enable_sriov value true cmode permanent + devlink dev param set pci/0000:01:00.0 name total_vfs value 8 cmode permanent + + # Fw reset + devlink dev reload pci/0000:01:00.0 action fw_activate + + # for PCI related config such as sriov PCI reset/rescan is required: + echo 1 >/sys/bus/pci/devices/0000:01:00.0/remove + echo 1 >/sys/bus/pci/rescan + grep ^ /sys/bus/pci/devices/0000:01:00.0/sriov_* + + * - ``num_doorbells`` + - driverinit + - This controls the number of channel doorbells used by the netdev. In all + cases, an additional doorbell is allocated and used for non-channel + communication (e.g. for PTP, HWS, etc.). Supported values are: + + - 0: No channel-specific doorbells, use the global one for everything. + - [1, max_num_channels]: Spread netdev channels equally across these + doorbells. The ``mlx5`` driver also implements the following driver-specific parameters. @@ -116,6 +155,68 @@ parameters. - u32 - driverinit - Control the size (in packets) of the hairpin queues. + * - ``pcie_cong_inbound_high`` + - u16 + - driverinit + - High threshold configuration for PCIe congestion events. The firmware + will send an event once device side inbound PCIe traffic went + above the configured high threshold for a long enough period (at least + 200ms). + + See pci_bw_inbound_high ethtool stat. + + Units are 0.01 %. Accepted values are in range [0, 10000]. + pcie_cong_inbound_low < pcie_cong_inbound_high. + Default value: 9000 (Corresponds to 90%). + * - ``pcie_cong_inbound_low`` + - u16 + - driverinit + - Low threshold configuration for PCIe congestion events. The firmware + will send an event once device side inbound PCIe traffic went + below the configured low threshold, only after having been previously in + a congested state. + + See pci_bw_inbound_low ethtool stat. + + Units are 0.01 %. Accepted values are in range [0, 10000]. + pcie_cong_inbound_low < pcie_cong_inbound_high. + Default value: 7500. + * - ``pcie_cong_outbound_high`` + - u16 + - driverinit + - High threshold configuration for PCIe congestion events. The firmware + will send an event once device side outbound PCIe traffic went + above the configured high threshold for a long enough period (at least + 200ms). + + See pci_bw_outbound_high ethtool stat. + + Units are 0.01 %. Accepted values are in range [0, 10000]. + pcie_cong_outbound_low < pcie_cong_outbound_high. + Default value: 9000 (Corresponds to 90%). + * - ``pcie_cong_outbound_low`` + - u16 + - driverinit + - Low threshold configuration for PCIe congestion events. The firmware + will send an event once device side outbound PCIe traffic went + below the configured low threshold, only after having been previously in + a congested state. + + See pci_bw_outbound_low ethtool stat. + + Units are 0.01 %. Accepted values are in range [0, 10000]. + pcie_cong_outbound_low < pcie_cong_outbound_high. + Default value: 7500. + + * - ``cqe_compress_type`` + - string + - permanent + - Configure which mechanism/algorithm should be used by the NIC that will + affect the rate (aggressiveness) of compressed CQEs depending on PCIe bus + conditions and other internal NIC factors. This mode affects all queues + that enable compression. + * ``balanced`` : Merges fewer CQEs, resulting in a moderate compression ratio but maintaining a balance between bandwidth savings and performance + * ``aggressive`` : Merges more CQEs into a single entry, achieving a higher compression rate and maximizing performance, particularly under high traffic loads The ``mlx5`` driver supports reloading via ``DEVLINK_CMD_RELOAD`` @@ -284,6 +385,12 @@ Description of the vnic counters: amount of Interconnect Host Memory (ICM) consumed by the vnic in granularity of 4KB. ICM is host memory allocated by SW upon HCA request and is used for storing data structures that control HCA operation. +- bar_uar_access + number of WRITE or READ access operations to the UAR on the PCIe BAR. +- odp_local_triggered_page_fault + number of locally-triggered page-faults due to ODP. +- odp_remote_triggered_page_fault + number of remotly-triggered page-faults due to ODP. User commands examples: diff --git a/Documentation/networking/devlink/zl3073x.rst b/Documentation/networking/devlink/zl3073x.rst index 4b6cfaf38643..fc5a8dc272a7 100644 --- a/Documentation/networking/devlink/zl3073x.rst +++ b/Documentation/networking/devlink/zl3073x.rst @@ -49,3 +49,17 @@ The ``zl3073x`` driver reports the following versions - running - 1.3.0.1 - Device configuration version customized by OEM + +Flash Update +============ + +The ``zl3073x`` driver implements support for flash update using the +``devlink-flash`` interface. It supports updating the device flash using a +combined flash image ("bundle") that contains multiple components (firmware +parts and configurations). + +During the flash procedure, the standard firmware interface is not available, +so the driver unregisters all DPLLs and associated pins, and re-registers them +once the flash procedure is complete. + +The driver does not support any overwrite mask flags. diff --git a/Documentation/networking/dns_resolver.rst b/Documentation/networking/dns_resolver.rst index c0364f7070af..52f298834db6 100644 --- a/Documentation/networking/dns_resolver.rst +++ b/Documentation/networking/dns_resolver.rst @@ -25,11 +25,11 @@ These routines must be supported by userspace tools dns.upcall, cifs.upcall and request-key. It is under development and does not yet provide the full feature set. The features it does support include: - (*) Implements the dns_resolver key_type to contact userspace. + * Implements the dns_resolver key_type to contact userspace. It does not yet support the following AFS features: - (*) Dns query support for AFSDB resource record. + * DNS query support for AFSDB resource record. This code is extracted from the CIFS filesystem. @@ -64,44 +64,42 @@ before the more general line given above as the first match is the one taken:: Usage ===== -To make use of this facility, one of the following functions that are -implemented in the module can be called after doing:: +To make use of this facility, first ``dns_resolver.h`` must be included:: #include <linux/dns_resolver.h> - :: +Then queries may be made by calling:: int dns_query(const char *type, const char *name, size_t namelen, const char *options, char **_result, time_t *_expiry); - This is the basic access function. It looks for a cached DNS query and if - it doesn't find it, it upcalls to userspace to make a new DNS query, which - may then be cached. The key description is constructed as a string of the - form:: +This is the basic access function. It looks for a cached DNS query and if +it doesn't find it, it upcalls to userspace to make a new DNS query, which +may then be cached. The key description is constructed as a string of the +form:: [<type>:]<name> - where <type> optionally specifies the particular upcall program to invoke, - and thus the type of query to do, and <name> specifies the string to be - looked up. The default query type is a straight hostname to IP address - set lookup. +where <type> optionally specifies the particular upcall program to invoke, +and thus the type of query, and <name> specifies the string to be looked up. +The default query type is a straight hostname to IP address set lookup. - The name parameter is not required to be a NUL-terminated string, and its - length should be given by the namelen argument. +The name parameter is not required to be a NUL-terminated string, and its +length should be given by the namelen argument. - The options parameter may be NULL or it may be a set of options - appropriate to the query type. +The options parameter may be NULL or it may be a set of options +appropriate to the query type. - The return value is a string appropriate to the query type. For instance, - for the default query type it is just a list of comma-separated IPv4 and - IPv6 addresses. The caller must free the result. +The return value is a string appropriate to the query type. For instance, +for the default query type it is just a list of comma-separated IPv4 and +IPv6 addresses. The caller must free the result. - The length of the result string is returned on success, and a negative - error code is returned otherwise. -EKEYREJECTED will be returned if the - DNS lookup failed. +The length of the result string is returned on success, and a negative +error code is returned otherwise. -EKEYREJECTED will be returned if the +DNS lookup failed. - If _expiry is non-NULL, the expiry time (TTL) of the result will be - returned also. +If _expiry is non-NULL, the expiry time (TTL) of the result will be +returned also. The kernel maintains an internal keyring in which it caches looked up keys. This can be cleared by any process that has the CAP_SYS_ADMIN capability by @@ -142,8 +140,8 @@ the key will be discarded and recreated when the data it holds has expired. dns_query() returns a copy of the value attached to the key, or an error if that is indicated instead. -See <file:Documentation/security/keys/request-key.rst> for further -information about request-key function. +See Documentation/security/keys/request-key.rst for further information about +request-key function. Debugging diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst index ab20c644af24..b270886c5f5d 100644 --- a/Documentation/networking/ethtool-netlink.rst +++ b/Documentation/networking/ethtool-netlink.rst @@ -1541,6 +1541,11 @@ Drivers fill in the statistics in the following structure: .. kernel-doc:: include/linux/ethtool.h :identifiers: ethtool_fec_stats +Statistics may have FEC bins histogram attribute ``ETHTOOL_A_FEC_STAT_HIST`` +as defined in IEEE 802.3ck-2022 and 802.3df-2024. Nested attributes will have +the range of FEC errors in the bin (inclusive) and the amount of error events +in the bin. + FEC_SET ======= diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index ac90b82f3ce9..c775cababc8c 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -57,7 +57,7 @@ Contents: filter generic-hdlc generic_netlink - netlink_spec/index + ../netlink/specs/index gen_stats gtp ila @@ -101,6 +101,7 @@ Contents: ppp_generic proc_net_tcp pse-pd/index + psp radiotap-headers rds regulatory diff --git a/Documentation/networking/iou-zcrx.rst b/Documentation/networking/iou-zcrx.rst index 0127319b30bb..54a72e172bdc 100644 --- a/Documentation/networking/iou-zcrx.rst +++ b/Documentation/networking/iou-zcrx.rst @@ -75,7 +75,7 @@ Create an io_uring instance with the following required setup flags:: IORING_SETUP_SINGLE_ISSUER IORING_SETUP_DEFER_TASKRUN - IORING_SETUP_CQE32 + IORING_SETUP_CQE32 or IORING_SETUP_CQE_MIXED Create memory area ------------------ diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst index 9756d16e3df1..a06cb99d66dc 100644 --- a/Documentation/networking/ip-sysctl.rst +++ b/Documentation/networking/ip-sysctl.rst @@ -209,7 +209,7 @@ neigh/default/unres_qlen_bytes - INTEGER Setting negative value is meaningless and will return error. - Default: SK_WMEM_MAX, (same as net.core.wmem_default). + Default: SK_WMEM_DEFAULT, (same as net.core.wmem_default). Exact value depends on architecture and kernel options, but should be enough to allow queuing 256 packets @@ -443,23 +443,56 @@ tcp_early_retrans - INTEGER tcp_ecn - INTEGER Control use of Explicit Congestion Notification (ECN) by TCP. - ECN is used only when both ends of the TCP connection indicate - support for it. This feature is useful in avoiding losses due - to congestion by allowing supporting routers to signal - congestion before having to drop packets. + ECN is used only when both ends of the TCP connection indicate support + for it. This feature is useful in avoiding losses due to congestion by + allowing supporting routers to signal congestion before having to drop + packets. A host that supports ECN both sends ECN at the IP layer and + feeds back ECN at the TCP layer. The highest variant of ECN feedback + that both peers support is chosen by the ECN negotiation (Accurate ECN, + ECN, or no ECN). + + The highest negotiated variant for incoming connection requests + and the highest variant requested by outgoing connection + attempts: + + ===== ==================== ==================== + Value Incoming connections Outgoing connections + ===== ==================== ==================== + 0 No ECN No ECN + 1 ECN ECN + 2 ECN No ECN + 3 AccECN AccECN + 4 AccECN ECN + 5 AccECN No ECN + ===== ==================== ==================== + + Default: 2 + +tcp_ecn_option - INTEGER + Control Accurate ECN (AccECN) option sending when AccECN has been + successfully negotiated during handshake. Send logic inhibits + sending AccECN options regarless of this setting when no AccECN + option has been seen for the reverse direction. Possible values are: - = ===================================================== - 0 Disable ECN. Neither initiate nor accept ECN. - 1 Enable ECN when requested by incoming connections and - also request ECN on outgoing connection attempts. - 2 Enable ECN when requested by incoming connections - but do not request ECN on outgoing connections. - = ===================================================== + = ============================================================ + 0 Never send AccECN option. This also disables sending AccECN + option in SYN/ACK during handshake. + 1 Send AccECN option sparingly according to the minimum option + rules outlined in draft-ietf-tcpm-accurate-ecn. + 2 Send AccECN option on every packet whenever it fits into TCP + option space. + = ============================================================ Default: 2 +tcp_ecn_option_beacon - INTEGER + Control Accurate ECN (AccECN) option sending frequency per RTT and it + takes effect only when tcp_ecn_option is set to 2. + + Default: 3 (AccECN will be send at least 3 times per RTT) + tcp_ecn_fallback - BOOLEAN If the kernel detects that ECN connection misbehaves, enable fall back to non-ECN. Currently, this knob implements the fallback @@ -805,8 +838,8 @@ tcp_rmem - vector of 3 INTEGERs: min, default, max This value results in initial window of 65535. max: maximal size of receive buffer allowed for automatically - selected receiver buffers for TCP socket. This value does not override - net.core.rmem_max. Calling setsockopt() with SO_RCVBUF disables + selected receiver buffers for TCP socket. + Calling setsockopt() with SO_RCVBUF disables automatic tuning of that socket's receive buffer size, in which case this value is ignored. Default: between 131072 and 32MB, depending on RAM size. @@ -3508,16 +3541,10 @@ cookie_hmac_alg - STRING a listening sctp socket to a connecting client in the INIT-ACK chunk. Valid values are: - * md5 - * sha1 + * sha256 * none - Ability to assign md5 or sha1 as the selected alg is predicated on the - configuration of those algorithms at build time (CONFIG_CRYPTO_MD5 and - CONFIG_CRYPTO_SHA1). - - Default: Dependent on configuration. MD5 if available, else SHA1 if - available, else none. + Default: sha256 rcvbuf_policy - INTEGER Determines if the receive buffer is attributed to the socket or to diff --git a/Documentation/networking/mptcp-sysctl.rst b/Documentation/networking/mptcp-sysctl.rst index 1683c139821e..1eb6af26b4a7 100644 --- a/Documentation/networking/mptcp-sysctl.rst +++ b/Documentation/networking/mptcp-sysctl.rst @@ -8,9 +8,11 @@ MPTCP Sysfs variables =============================== add_addr_timeout - INTEGER (seconds) - Set the timeout after which an ADD_ADDR control message will be - resent to an MPTCP peer that has not acknowledged a previous - ADD_ADDR message. + Set the maximum value of timeout after which an ADD_ADDR control message + will be resent to an MPTCP peer that has not acknowledged a previous + ADD_ADDR message. A dynamically estimated retransmission timeout based + on the estimated connection round-trip-time is used if this value is + lower than the maximum one. Do not retransmit if set to 0. diff --git a/Documentation/networking/mptcp.rst b/Documentation/networking/mptcp.rst index 2e31038d6462..b6753ffb9c9a 100644 --- a/Documentation/networking/mptcp.rst +++ b/Documentation/networking/mptcp.rst @@ -66,7 +66,7 @@ the same rules are applied for all the connections (see: ``ip mptcp``) ; and the userspace one (``userspace``), controlled by a userspace daemon (i.e. `mptcpd <https://mptcpd.mptcp.dev/>`_) where different rules can be applied for each connection. The path managers can be controlled via a Netlink API; see -netlink_spec/mptcp_pm.rst. +../netlink/specs/mptcp_pm.rst. To be able to use multiple IP addresses on a host to create multiple *subflows* (paths), the default in-kernel MPTCP path-manager needs to know which IP diff --git a/Documentation/networking/net_cachelines/tcp_sock.rst b/Documentation/networking/net_cachelines/tcp_sock.rst index 7bbda5944ee2..26f32dbcf6ec 100644 --- a/Documentation/networking/net_cachelines/tcp_sock.rst +++ b/Documentation/networking/net_cachelines/tcp_sock.rst @@ -26,8 +26,8 @@ u64 bytes_acked read_w u32 dsack_dups u32 snd_una read_mostly read_write tcp_wnd_end,tcp_urg_mode,tcp_minshall_check,tcp_cwnd_validate(tx);tcp_ack,tcp_may_update_window,tcp_clean_rtx_queue(write),tcp_ack_tstamp(rx) u32 snd_sml read_write tcp_minshall_check,tcp_minshall_update -u32 rcv_tstamp read_mostly tcp_ack -void * tcp_clean_acked read_mostly tcp_ack +u32 rcv_tstamp read_write read_write tcp_ack +void * tcp_clean_acked read_mostly tcp_ack u32 lsndtime read_write tcp_slow_start_after_idle_check,tcp_event_data_sent u32 last_oow_ack_time u32 compressed_ack_rcv_nxt @@ -57,7 +57,7 @@ u8:1 is_sack_reneg read_m u8:2 fastopen_client_fail u8:4 nonagle read_write tcp_skb_entail,tcp_push_pending_frames u8:1 thin_lto -u8:1 recvmsg_inq +u8:1 recvmsg_inq read_mostly tcp_recvmsg u8:1 repair read_mostly tcp_write_xmit u8:1 frto u8 repair_queue @@ -101,6 +101,18 @@ u32 prr_delivered u32 prr_out read_mostly read_mostly tcp_rate_skb_sent,tcp_newly_delivered(tx);tcp_ack,tcp_rate_gen,tcp_clean_rtx_queue(rx) u32 delivered read_mostly read_write tcp_rate_skb_sent, tcp_newly_delivered(tx);tcp_ack, tcp_rate_gen, tcp_clean_rtx_queue (rx) u32 delivered_ce read_mostly read_write tcp_rate_skb_sent(tx);tcp_rate_gen(rx) +u32 received_ce read_mostly read_write +u32[3] received_ecn_bytes read_mostly read_write +u8:4 received_ce_pending read_mostly read_write +u32[3] delivered_ecn_bytes read_write +u8:2 syn_ect_snt write_mostly read_write +u8:2 syn_ect_rcv read_mostly read_write +u8:2 accecn_minlen write_mostly read_write +u8:2 est_ecnfield read_write +u8:2 accecn_opt_demand read_mostly read_write +u8:2 prev_ecnfield read_write +u64 accecn_opt_tstamp read_write +u8:4 accecn_fail_mode u32 lost read_mostly tcp_ack u32 app_limited read_write read_mostly tcp_rate_check_app_limited,tcp_rate_skb_sent(tx);tcp_rate_gen(rx) u64 first_tx_mstamp read_write tcp_rate_skb_sent diff --git a/Documentation/networking/netlink_spec/.gitignore b/Documentation/networking/netlink_spec/.gitignore deleted file mode 100644 index 30d85567b592..000000000000 --- a/Documentation/networking/netlink_spec/.gitignore +++ /dev/null @@ -1 +0,0 @@ -*.rst diff --git a/Documentation/networking/netlink_spec/readme.txt b/Documentation/networking/netlink_spec/readme.txt deleted file mode 100644 index 030b44aca4e6..000000000000 --- a/Documentation/networking/netlink_spec/readme.txt +++ /dev/null @@ -1,4 +0,0 @@ -SPDX-License-Identifier: GPL-2.0 - -This file is populated during the build of the documentation (htmldocs) by the -tools/net/ynl/pyynl/ynl_gen_rst.py script. diff --git a/Documentation/networking/phy.rst b/Documentation/networking/phy.rst index 7f159043ad5a..b0f2ef83735d 100644 --- a/Documentation/networking/phy.rst +++ b/Documentation/networking/phy.rst @@ -20,7 +20,7 @@ sometimes quite different) ethernet controllers connected to the same management bus, it is difficult to ensure safe use of the bus. Since the PHYs are devices, and the management busses through which they are -accessed are, in fact, busses, the PHY Abstraction Layer treats them as such. +accessed are, in fact, busses, the PHY Abstraction Layer (PAL) treats them as such. In doing so, it has these goals: #. Increase code-reuse diff --git a/Documentation/networking/psp.rst b/Documentation/networking/psp.rst new file mode 100644 index 000000000000..4ac09e64e95a --- /dev/null +++ b/Documentation/networking/psp.rst @@ -0,0 +1,183 @@ +.. SPDX-License-Identifier: GPL-2.0-only + +===================== +PSP Security Protocol +===================== + +Protocol +======== + +PSP Security Protocol (PSP) was defined at Google and published in: + +https://raw.githubusercontent.com/google/psp/main/doc/PSP_Arch_Spec.pdf + +This section briefly covers protocol aspects crucial for understanding +the kernel API. Refer to the protocol specification for further details. + +Note that the kernel implementation and documentation uses the term +"device key" in place of "master key", it is both less confusing +to an average developer and is less likely to run afoul any naming +guidelines. + +Derived Rx keys +--------------- + +PSP borrows some terms and mechanisms from IPsec. PSP was designed +with HW offloads in mind. The key feature of PSP is that Rx keys for every +connection do not have to be stored by the receiver but can be derived +from device key and information present in packet headers. +This makes it possible to implement receivers which require a constant +amount of memory regardless of the number of connections (``O(1)`` scaling). + +Tx keys have to be stored like with any other protocol, but Tx is much +less latency sensitive than Rx, and delays in fetching keys from slow +memory is less likely to cause packet drops. Preferably, the Tx keys +should be provided with the packet (e.g. as part of the descriptors). + +Key rotation +------------ + +The device key known only to the receiver is fundamental to the design. +Per specification this state cannot be directly accessible (it must be +impossible to read it out of the hardware of the receiver NIC). +Moreover, it has to be "rotated" periodically (usually daily). Rotation +means that new device key gets generated (by a random number generator +of the device), and used for all new connections. To avoid disrupting +old connections the old device key remains in the NIC. A phase bit +carried in the packet headers indicates which generation of device key +the packet has been encrypted with. + +User facing API +=============== + +PSP is designed primarily for hardware offloads. There is currently +no software fallback for systems which do not have PSP capable NICs. +There is also no standard (or otherwise defined) way of establishing +a PSP-secured connection or exchanging the symmetric keys. + +The expectation is that higher layer protocols will take care of +protocol and key negotiation. For example one may use TLS key exchange, +announce the PSP capability, and switch to PSP if both endpoints +are PSP-capable. + +All configuration of PSP is performed via the PSP netlink family. + +Device discovery +---------------- + +The PSP netlink family defines operations to retrieve information +about the PSP devices available on the system, configure them and +access PSP related statistics. + +Securing a connection +--------------------- + +PSP encryption is currently only supported for TCP connections. +Rx and Tx keys are allocated separately. First the ``rx-assoc`` +Netlink command needs to be issued, specifying a target TCP socket. +Kernel will allocate a new PSP Rx key from the NIC and associate it +with given socket. At this stage socket will accept both PSP-secured +and plain text TCP packets. + +Tx keys are installed using the ``tx-assoc`` Netlink command. +Once the Tx keys are installed, all data read from the socket will +be PSP-secured. In other words act of installing Tx keys has a secondary +effect on the Rx direction. + +There is an intermediate period after ``tx-assoc`` successfully +returns and before the TCP socket encounters it's first PSP +authenticated packet, where the TCP stack will allow certain nondata +packets, i.e. ACKs, FINs, and RSTs, to enter TCP receive processing +even if not PSP authenticated. During the ``tx-assoc`` call, the TCP +socket's ``rcv_nxt`` field is recorded. At this point, ACKs and RSTs +will be accepted with any sequence number, while FINs will only be +accepted at the latched value of ``rcv_nxt``. Once the TCP stack +encounters the first TCP packet containing PSP authenticated data, the +other end of the connection must have executed the ``tx-assoc`` +command, so any TCP packet, including those without data, will be +dropped before receive processing if it is not successfully +authenticated. This is summarized in the table below. The +aforementioned state of rejecting all non-PSP packets is labeled "PSP +Full". + ++----------------+------------+------------+-------------+-------------+ +| Event | Normal TCP | Rx PSP | Tx PSP | PSP Full | ++================+============+============+=============+=============+ +| Rx plain | accept | accept | drop | drop | +| (data) | | | | | ++----------------+------------+------------+-------------+-------------+ +| Rx plain | accept | accept | accept | drop | +| (ACK|FIN|RST) | | | | | ++----------------+------------+------------+-------------+-------------+ +| Rx PSP (good) | drop | accept | accept | accept | ++----------------+------------+------------+-------------+-------------+ +| Rx PSP (bad | drop | drop | drop | drop | +| crypt, !=SPI) | | | | | ++----------------+------------+------------+-------------+-------------+ +| Tx | plain text | plain text | encrypted | encrypted | +| | | | (excl. rtx) | (excl. rtx) | ++----------------+------------+------------+-------------+-------------+ + +To ensure that any data read from the socket after the ``tx-assoc`` +call returns success has been authenticated, the kernel will scan the +receive and ofo queues of the socket at ``tx-assoc`` time. If any +enqueued packet was received in clear text, the Tx association will +fail, and the application should retry installing the Tx key after +draining the socket (this should not be necessary if both endpoints +are well behaved). + +Because TCP sequence numbers are not integrity protected prior to +upgrading to PSP, it is possible that a MITM could offset sequence +numbers in a way that deletes a prefix of the PSP protected part of +the TCP stream. If userspace cares to mitigate this type of attack, a +special "start of PSP" message should be exchanged after ``tx-assoc``. + +Rotation notifications +---------------------- + +The rotations of device key happen asynchronously and are usually +performed by management daemons, not under application control. +The PSP netlink family will generate a notification whenever keys +are rotated. The applications are expected to re-establish connections +before keys are rotated again. + +Kernel implementation +===================== + +Driver notes +------------ + +Drivers are expected to start with no PSP enabled (``psp-versions-ena`` +in ``dev-get`` set to ``0``) whenever possible. The user space should +not depend on this behavior, as future extension may necessitate creation +of devices with PSP already enabled, nonetheless drivers should not enable +PSP by default. Enabling PSP should be the responsibility of the system +component which also takes care of key rotation. + +Note that ``psp-versions-ena`` is expected to be used only for enabling +receive processing. The device is not expected to reject transmit requests +after ``psp-versions-ena`` has been disabled. User may also disable +``psp-versions-ena`` while there are active associations, which will +break all PSP Rx processing. + +Drivers are expected to ensure that a device key is usable and secure +upon init, without explicit key rotation by the user space. It must be +possible to allocate working keys, and that no duplicate keys must be +generated. If the device allows the host to request the key for an +arbitrary SPI - driver should discard both device keys (rotate the +device key twice), to avoid potentially using a SPI+key which previous +OS instance already had access to. + +Drivers must use ``psp_skb_get_assoc_rcu()`` to check if PSP Tx offload +was requested for given skb. On Rx drivers should allocate and populate +the ``SKB_EXT_PSP`` skb extension, and set the skb->decrypted bit to 1. + +Kernel implementation notes +--------------------------- + +PSP implementation follows the TLS offload more closely than the IPsec +offload, with per-socket state, and the use of skb->decrypted to prevent +clear text leaks. + +PSP device is separate from netdev, to make it possible to "delegate" +PSP offload capabilities to software devices (e.g. ``veth``). diff --git a/Documentation/networking/rds.rst b/Documentation/networking/rds.rst index 41b0a6182fe4..4261146e9d92 100644 --- a/Documentation/networking/rds.rst +++ b/Documentation/networking/rds.rst @@ -339,7 +339,7 @@ The send path rds_sendmsg() - struct rds_message built from incoming data - CMSGs parsed (e.g. RDMA ops) - - transport connection alloced and connected if not already + - transport connection allocated and connected if not already - rds_message placed on send queue - send worker awoken diff --git a/Documentation/networking/rxrpc.rst b/Documentation/networking/rxrpc.rst index d63e3e27dd06..8926dab8e2e6 100644 --- a/Documentation/networking/rxrpc.rst +++ b/Documentation/networking/rxrpc.rst @@ -437,8 +437,7 @@ message type supported. At run time this can be queried by means of the RXRPC_SUPPORTED_CMSG socket option (see below). -============== -SOCKET OPTIONS +Socket Options ============== AF_RXRPC sockets support a few socket options at the SOL_RXRPC level: @@ -495,8 +494,7 @@ AF_RXRPC sockets support a few socket options at the SOL_RXRPC level: the highest control message type supported. -======== -SECURITY +Security ======== Currently, only the kerberos 4 equivalent protocol has been implemented @@ -540,8 +538,7 @@ be found at: http://people.redhat.com/~dhowells/rxrpc/listen.c -==================== -EXAMPLE CLIENT USAGE +Example Client Usage ==================== A client would issue an operation by: diff --git a/Documentation/networking/segmentation-offloads.rst b/Documentation/networking/segmentation-offloads.rst index 085e8fab03fd..72f69b22b28c 100644 --- a/Documentation/networking/segmentation-offloads.rst +++ b/Documentation/networking/segmentation-offloads.rst @@ -43,10 +43,19 @@ also point to the TCP header of the packet. For IPv4 segmentation we support one of two types in terms of the IP ID. The default behavior is to increment the IP ID with every segment. If the GSO type SKB_GSO_TCP_FIXEDID is specified then we will not increment the IP -ID and all segments will use the same IP ID. If a device has -NETIF_F_TSO_MANGLEID set then the IP ID can be ignored when performing TSO -and we will either increment the IP ID for all frames, or leave it at a -static value based on driver preference. +ID and all segments will use the same IP ID. + +For encapsulated packets, SKB_GSO_TCP_FIXEDID refers only to the outer header. +SKB_GSO_TCP_FIXEDID_INNER can be used to specify the same for the inner header. +Any combination of these two GSO types is allowed. + +If a device has NETIF_F_TSO_MANGLEID set then the IP ID can be ignored when +performing TSO and we will either increment the IP ID for all frames, or leave +it at a static value based on driver preference. For encapsulated packets, +NETIF_F_TSO_MANGLEID is relevant for both outer and inner headers, unless the +DF bit is not set on the outer header, in which case the device driver must +guarantee that the IP ID field is incremented in the outer header with every +segment. UDP Fragmentation Offload @@ -124,10 +133,7 @@ Generic Receive Offload Generic receive offload is the complement to GSO. Ideally any frame assembled by GRO should be segmented to create an identical sequence of frames using GSO, and any sequence of frames segmented by GSO should be -able to be reassembled back to the original by GRO. The only exception to -this is IPv4 ID in the case that the DF bit is set for a given IP header. -If the value of the IPv4 ID is not sequentially incrementing it will be -altered so that it is when a frame assembled via GRO is segmented via GSO. +able to be reassembled back to the original by GRO. Partial Generic Segmentation Offload |