diff options
Diffstat (limited to 'Documentation/networking')
27 files changed, 1161 insertions, 239 deletions
diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst index dceeb0d763aa..50d92084a49c 100644 --- a/Documentation/networking/af_xdp.rst +++ b/Documentation/networking/af_xdp.rst @@ -209,13 +209,10 @@ Libbpf Libbpf is a helper library for eBPF and XDP that makes using these technologies a lot simpler. It also contains specific helper functions -in tools/lib/bpf/xsk.h for facilitating the use of AF_XDP. It -contains two types of functions: those that can be used to make the -setup of AF_XDP socket easier and ones that can be used in the data -plane to access the rings safely and quickly. To see an example on how -to use this API, please take a look at the sample application in -samples/bpf/xdpsock_usr.c which uses libbpf for both setup and data -plane operations. +in tools/testing/selftests/bpf/xsk.h for facilitating the use of +AF_XDP. It contains two types of functions: those that can be used to +make the setup of AF_XDP socket easier and ones that can be used in the +data plane to access the rings safely and quickly. We recommend that you use this library unless you have become a power user. It will make your program a lot simpler. @@ -372,9 +369,8 @@ needs to explicitly notify the kernel to send any packets put on the TX ring. This can be accomplished either by a poll() call, as in the RX path, or by calling sendto(). -An example of how to use this flag can be found in -samples/bpf/xdpsock_user.c. An example with the use of libbpf helpers -would look like this for the TX path: +An example with the use of libbpf helpers would look like this for the +TX path: .. code-block:: c @@ -442,6 +438,15 @@ is created by a privileged process and passed to a non-privileged one. Once the option is set, kernel will refuse attempts to bind that socket to a different interface. Updating the value requires CAP_NET_RAW. +XDP_MAX_TX_SKB_BUDGET setsockopt +-------------------------------- + +This setsockopt sets the maximum number of descriptors that can be handled +and passed to the driver at one send syscall. It is applied in the copy +mode to allow application to tune the per-socket maximum iteration for +better throughput and less frequency of send syscall. +Allowed range is [32, xs->tx->nentries]. + XDP_STATISTICS getsockopt ------------------------- @@ -549,12 +554,12 @@ later in this document. Usage ----- -In order to use AF_XDP sockets two parts are needed. The -user-space application and the XDP program. For a complete setup and -usage example, please refer to the sample application. The user-space -side is xdpsock_user.c and the XDP side is part of libbpf. +In order to use AF_XDP sockets two parts are needed. The user-space +application and the XDP program. For a complete setup and usage example, +please refer to the xdp-project at +https://github.com/xdp-project/bpf-examples/tree/main/AF_XDP-example. -The XDP code sample included in tools/lib/bpf/xsk.c is the following: +The XDP code sample is the following: .. code-block:: c @@ -752,11 +757,12 @@ to facilitate extending a zero-copy driver with multi-buffer support. Sample application ================== - -There is a xdpsock benchmarking/test application included that -demonstrates how to use AF_XDP sockets with private UMEMs. Say that -you would like your UDP traffic from port 4242 to end up in queue 16, -that we will enable AF_XDP on. Here, we use ethtool for this:: +There is a xdpsock benchmarking/test application that can be found at +https://github.com/xdp-project/bpf-examples/tree/main/AF_XDP-example +that demonstrates how to use AF_XDP sockets with private +UMEMs. Say that you would like your UDP traffic from port 4242 to end +up in queue 16, that we will enable AF_XDP on. Here, we use ethtool +for this:: ethtool -N p3p2 rx-flow-hash udp4 fn ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \ @@ -773,7 +779,7 @@ can be displayed with "-h", as usual. This sample application uses libbpf to make the setup and usage of AF_XDP simpler. If you want to know how the raw uapi of AF_XDP is really used to make something more advanced, take a look at the libbpf -code in tools/lib/bpf/xsk.[ch]. +code in tools/testing/selftests/bpf/xsk.[ch]. FAQ ======= diff --git a/Documentation/networking/bonding.rst b/Documentation/networking/bonding.rst index a4c1291d2561..f8f5766703d4 100644 --- a/Documentation/networking/bonding.rst +++ b/Documentation/networking/bonding.rst @@ -562,6 +562,12 @@ lacp_rate The default is slow. +broadcast_neighbor + + Option specifying whether to broadcast ARP/ND packets to all + active slaves. This option has no effect in modes other than + 802.3ad mode. The default is off (0). + max_bonds Specifies the number of bonding devices to create for this @@ -767,8 +773,9 @@ num_unsol_na greater than 1. The valid range is 0 - 255; the default value is 1. These options - affect only the active-backup mode. These options were added for - bonding versions 3.3.0 and 3.4.0 respectively. + affect the active-backup or 802.3ad (broadcast_neighbor enabled) mode. + These options were added for bonding versions 3.3.0 and 3.4.0 + respectively. From Linux 3.0 and bonding version 3.7.1, these notifications are generated by the ipv4 and ipv6 code and the numbers of diff --git a/Documentation/networking/can.rst b/Documentation/networking/can.rst index b018ce346392..bc1b585355f7 100644 --- a/Documentation/networking/can.rst +++ b/Documentation/networking/can.rst @@ -1104,15 +1104,12 @@ for writing CAN network device driver are described below: General Settings ---------------- -.. code-block:: C - - dev->type = ARPHRD_CAN; /* the netdevice hardware type */ - dev->flags = IFF_NOARP; /* CAN has no arp */ +CAN network device drivers can use alloc_candev_mqs() and friends instead of +alloc_netdev_mqs(), to automatically take care of CAN-specific setup: - dev->mtu = CAN_MTU; /* sizeof(struct can_frame) -> Classical CAN interface */ +.. code-block:: C - or alternative, when the controller supports CAN with flexible data rate: - dev->mtu = CANFD_MTU; /* sizeof(struct canfd_frame) -> CAN FD interface */ + dev = alloc_candev_mqs(...); The struct can_frame or struct canfd_frame is the payload of each socket buffer (skbuff) in the protocol family PF_CAN. diff --git a/Documentation/networking/device_drivers/ethernet/amazon/ena.rst b/Documentation/networking/device_drivers/ethernet/amazon/ena.rst index 4561e8ab9e08..14784a0a6a8a 100644 --- a/Documentation/networking/device_drivers/ethernet/amazon/ena.rst +++ b/Documentation/networking/device_drivers/ethernet/amazon/ena.rst @@ -56,6 +56,9 @@ ena_netdev.[ch] Main Linux kernel driver. ena_ethtool.c ethtool callbacks. ena_xdp.[ch] XDP files ena_pci_id_tbl.h Supported device IDs. +ena_phc.[ch] PTP hardware clock infrastructure (see `PHC`_ for more info) +ena_devlink.[ch] devlink files. +ena_debugfs.[ch] debugfs files. ================= ====================================================== Management Interface: @@ -221,6 +224,99 @@ descriptor it was received on would be recycled. When a packet smaller than RX copybreak bytes is received, it is copied into a new memory buffer and the RX descriptor is returned to HW. +.. _`PHC`: + +PTP Hardware Clock (PHC) +======================== +.. _`ptp-userspace-api`: https://docs.kernel.org/driver-api/ptp.html#ptp-hardware-clock-user-space-api +.. _`testptp`: https://elixir.bootlin.com/linux/latest/source/tools/testing/selftests/ptp/testptp.c + +ENA Linux driver supports PTP hardware clock providing timestamp reference to achieve nanosecond resolution. + +**PHC support** + +PHC depends on the PTP module, which needs to be either loaded as a module or compiled into the kernel. + +Verify if the PTP module is present: + +.. code-block:: shell + + grep -w '^CONFIG_PTP_1588_CLOCK=[ym]' /boot/config-`uname -r` + +- If no output is provided, the ENA driver cannot be loaded with PHC support. + +**PHC activation** + +The feature is turned off by default, in order to turn the feature on, the ENA driver +can be loaded in the following way: + +- devlink: + +.. code-block:: shell + + sudo devlink dev param set pci/<domain:bus:slot.function> name enable_phc value true cmode driverinit + sudo devlink dev reload pci/<domain:bus:slot.function> + # for example: + sudo devlink dev param set pci/0000:00:06.0 name enable_phc value true cmode driverinit + sudo devlink dev reload pci/0000:00:06.0 + +All available PTP clock sources can be tracked here: + +.. code-block:: shell + + ls /sys/class/ptp + +PHC support and capabilities can be verified using ethtool: + +.. code-block:: shell + + ethtool -T <interface> + +**PHC timestamp** + +To retrieve PHC timestamp, use `ptp-userspace-api`_, usage example using `testptp`_: + +.. code-block:: shell + + testptp -d /dev/ptp$(ethtool -T <interface> | awk '/PTP Hardware Clock:/ {print $NF}') -k 1 + +PHC get time requests should be within reasonable bounds, +avoid excessive utilization to ensure optimal performance and efficiency. +The ENA device restricts the frequency of PHC get time requests to a maximum +of 125 requests per second. If this limit is surpassed, the get time request +will fail, leading to an increment in the phc_err_ts statistic. + +**PHC statistics** + +PHC can be monitored using debugfs (if mounted): + +.. code-block:: shell + + sudo cat /sys/kernel/debug/<domain:bus:slot.function>/phc_stats + + # for example: + sudo cat /sys/kernel/debug/0000:00:06.0/phc_stats + +PHC errors must remain below 1% of all PHC requests to maintain the desired level of accuracy and reliability + +================= ====================================================== +**phc_cnt** | Number of successful retrieved timestamps (below expire timeout). +**phc_exp** | Number of expired retrieved timestamps (above expire timeout). +**phc_skp** | Number of skipped get time attempts (during block period). +**phc_err_dv** | Number of failed get time attempts due to device errors (entering into block state). +**phc_err_ts** | Number of failed get time attempts due to timestamp errors (entering into block state), + | This occurs if driver exceeded the request limit or device received an invalid timestamp. +================= ====================================================== + +PHC timeouts: + +================= ====================================================== +**expire** | Max time for a valid timestamp retrieval, passing this threshold will fail + | the get time request and block new requests until block timeout. +**block** | Blocking period starts once get time request expires or fails, + | all get time requests during block period will be skipped. +================= ====================================================== + Statistics ========== @@ -268,6 +364,18 @@ RSS - The user can provide a hash key, hash function, and configure the indirection table through `ethtool(8)`. +DEVLINK SUPPORT +=============== +.. _`devlink`: https://www.kernel.org/doc/html/latest/networking/devlink/index.html + +`devlink`_ supports reloading the driver and initiating re-negotiation with the ENA device + +.. code-block:: shell + + sudo devlink dev reload pci/<domain:bus:slot.function> + # for example: + sudo devlink dev reload pci/0000:00:06.0 + DATA PATH ========= diff --git a/Documentation/networking/device_drivers/ethernet/index.rst b/Documentation/networking/device_drivers/ethernet/index.rst index 139b4c75a191..40ac552641a3 100644 --- a/Documentation/networking/device_drivers/ethernet/index.rst +++ b/Documentation/networking/device_drivers/ethernet/index.rst @@ -58,7 +58,9 @@ Contents: ti/tlan ti/icssg_prueth wangxun/txgbe + wangxun/txgbevf wangxun/ngbe + wangxun/ngbevf .. only:: subproject and html diff --git a/Documentation/networking/device_drivers/ethernet/intel/ice.rst b/Documentation/networking/device_drivers/ethernet/intel/ice.rst index 3c46a48d99ba..0bca293cf9cb 100644 --- a/Documentation/networking/device_drivers/ethernet/intel/ice.rst +++ b/Documentation/networking/device_drivers/ethernet/intel/ice.rst @@ -927,6 +927,19 @@ To enable/disable UDP Segmentation Offload, issue the following command:: # ethtool -K <ethX> tx-udp-segmentation [off|on] +PTP pin interface +----------------- +All adapters support standard PTP pin interface. SDPs (Software Definable Pin) +are single ended pins with both periodic output and external timestamp +supported. There are also specific differential input/output pins (TIME_SYNC, +1PPS) with only one of the functions supported. + +There are adapters with DPLL, where pins are connected to the DPLL instead of +being exposed on the board. You have to be aware that in those configurations, +only SDP pins are exposed and each pin has its own fixed direction. +To see input signal on those PTP pins, you need to configure DPLL properly. +Output signal is only visible on DPLL and to send it to the board SMA/U.FL pins, +DPLL output pins have to be manually configured. GNSS module ----------- diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst index 43d72c8b713b..754c81436408 100644 --- a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst +++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst @@ -1341,3 +1341,35 @@ Device Counters - The number of times the device owned queue had not enough buffers allocated. - Error + + * - `pci_bw_inbound_high` + - The number of times the device crossed the high inbound pcie bandwidth + threshold. To be compared to pci_bw_inbound_low to check if the device + is in a congested state. + If pci_bw_inbound_high == pci_bw_inbound_low then the device is not congested. + If pci_bw_inbound_high > pci_bw_inbound_low then the device is congested. + - Tnformative + + * - `pci_bw_inbound_low` + - The number of times the device crossed the low inbound PCIe bandwidth + threshold. To be compared to pci_bw_inbound_high to check if the device + is in a congested state. + If pci_bw_inbound_high == pci_bw_inbound_low then the device is not congested. + If pci_bw_inbound_high > pci_bw_inbound_low then the device is congested. + - Informative + + * - `pci_bw_outbound_high` + - The number of times the device crossed the high outbound pcie bandwidth + threshold. To be compared to pci_bw_outbound_low to check if the device + is in a congested state. + If pci_bw_outbound_high == pci_bw_outbound_low then the device is not congested. + If pci_bw_outbound_high > pci_bw_outbound_low then the device is congested. + - Informative + + * - `pci_bw_outbound_low` + - The number of times the device crossed the low outbound PCIe bandwidth + threshold. To be compared to pci_bw_outbound_high to check if the device + is in a congested state. + If pci_bw_outbound_high == pci_bw_outbound_low then the device is not congested. + If pci_bw_outbound_high > pci_bw_outbound_low then the device is congested. + - Informative diff --git a/Documentation/networking/device_drivers/ethernet/meta/fbnic.rst b/Documentation/networking/device_drivers/ethernet/meta/fbnic.rst index f8592dec8851..afb8353daefd 100644 --- a/Documentation/networking/device_drivers/ethernet/meta/fbnic.rst +++ b/Documentation/networking/device_drivers/ethernet/meta/fbnic.rst @@ -28,6 +28,36 @@ devlink dev info provides version information for all three components. In addition to the version the hg commit hash of the build is included as a separate entry. +Configuration +------------- + +Ringparams (ethtool -g / -G) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +fbnic has two submission (host -> device) rings for every completion +(device -> host) ring. The three ring objects together form a single +"queue" as used by higher layer software (a Rx, or a Tx queue). + +For Rx the two submission rings are used to pass empty pages to the NIC. +Ring 0 is the Header Page Queue (HPQ), NIC will use its pages to place +L2-L4 headers (or full frames if frame is not header-data split). +Ring 1 is the Payload Page Queue (PPQ) and used for packet payloads. +The completion ring is used to receive packet notifications / metadata. +ethtool ``rx`` ringparam maps to the size of the completion ring, +``rx-mini`` to the HPQ, and ``rx-jumbo`` to the PPQ. + +For Tx both submission rings can be used to submit packets, the completion +ring carries notifications for both. fbnic uses one of the submission +rings for normal traffic from the stack and the second one for XDP frames. +ethtool ``tx`` ringparam controls both the size of the submission rings +and the completion ring. + +Every single entry on the HPQ and PPQ (``rx-mini``, ``rx-jumbo``) +corresponds to 4kB of allocated memory, while entries on the remaining +rings are in units of descriptors (8B). The ideal ratio of submission +and completion ring sizes will depend on the workload, as for small packets +multiple packets will fit into a single page. + Upgrading Firmware ------------------ diff --git a/Documentation/networking/device_drivers/ethernet/wangxun/ngbevf.rst b/Documentation/networking/device_drivers/ethernet/wangxun/ngbevf.rst new file mode 100644 index 000000000000..a39e3d5a1038 --- /dev/null +++ b/Documentation/networking/device_drivers/ethernet/wangxun/ngbevf.rst @@ -0,0 +1,16 @@ +.. SPDX-License-Identifier: GPL-2.0+ + +================================================================== +Linux Base Virtual Function Driver for Wangxun(R) Gigabit Ethernet +================================================================== + +WangXun Gigabit Virtual Function Linux driver. +Copyright(c) 2015 - 2025 Beijing WangXun Technology Co., Ltd. + +Support +======= +For general information, go to the website at: +https://www.net-swift.com + +If you got any problem, contact Wangxun support team via nic-support@net-swift.com +and Cc: netdev. diff --git a/Documentation/networking/device_drivers/ethernet/wangxun/txgbevf.rst b/Documentation/networking/device_drivers/ethernet/wangxun/txgbevf.rst new file mode 100644 index 000000000000..b2f759b7b518 --- /dev/null +++ b/Documentation/networking/device_drivers/ethernet/wangxun/txgbevf.rst @@ -0,0 +1,16 @@ +.. SPDX-License-Identifier: GPL-2.0+ + +=========================================================================== +Linux Base Virtual Function Driver for Wangxun(R) 10/25/40 Gigabit Ethernet +=========================================================================== + +WangXun 10/25/40 Gigabit Virtual Function Linux driver. +Copyright(c) 2015 - 2025 Beijing WangXun Technology Co., Ltd. + +Support +======= +For general information, go to the website at: +https://www.net-swift.com + +If you got any problem, contact Wangxun support team via nic-support@net-swift.com +and Cc: netdev. diff --git a/Documentation/networking/devlink/devlink-params.rst b/Documentation/networking/devlink/devlink-params.rst index 4e01dc32bc08..211b58177e12 100644 --- a/Documentation/networking/devlink/devlink-params.rst +++ b/Documentation/networking/devlink/devlink-params.rst @@ -137,3 +137,9 @@ own name. * - ``event_eq_size`` - u32 - Control the size of asynchronous control events EQ. + * - ``enable_phc`` + - Boolean + - Enable PHC (PTP Hardware Clock) functionality in the device. + * - ``clock_id`` + - u64 + - Clock ID used by the device for registering DPLL devices and pins. diff --git a/Documentation/networking/devlink/devlink-port.rst b/Documentation/networking/devlink/devlink-port.rst index 9d22d41a7cd1..5e397798a402 100644 --- a/Documentation/networking/devlink/devlink-port.rst +++ b/Documentation/networking/devlink/devlink-port.rst @@ -418,6 +418,14 @@ API allows to configure following rate object's parameters: to all node children limits. ``tx_max`` is an upper limit for children. ``tx_share`` is a total bandwidth distributed among children. +``tc_bw`` + Allow users to set the bandwidth allocation per traffic class on rate + objects. This enables fine-grained QoS configurations by assigning a relative + share value to each traffic class. The bandwidth is distributed in proportion + to the share value for each class, relative to the sum of all shares. + When applied to a non-leaf node, tc_bw determines how bandwidth is shared + among its child elements. + ``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case nodes with the same priority form a WFQ subgroup in the sibling group and arbitration among them is based on assigned weights. diff --git a/Documentation/networking/devlink/index.rst b/Documentation/networking/devlink/index.rst index 8319f43b5933..270a65a01411 100644 --- a/Documentation/networking/devlink/index.rst +++ b/Documentation/networking/devlink/index.rst @@ -85,6 +85,8 @@ parameters, info versions, and other features it supports. ionic ice ixgbe + kvaser_pciefd + kvaser_usb mlx4 mlx5 mlxsw @@ -98,3 +100,4 @@ parameters, info versions, and other features it supports. iosm octeontx2 sfc + zl3073x diff --git a/Documentation/networking/devlink/kvaser_pciefd.rst b/Documentation/networking/devlink/kvaser_pciefd.rst new file mode 100644 index 000000000000..075edd2a508a --- /dev/null +++ b/Documentation/networking/devlink/kvaser_pciefd.rst @@ -0,0 +1,24 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================= +kvaser_pciefd devlink support +============================= + +This document describes the devlink features implemented by the +``kvaser_pciefd`` device driver. + +Info versions +============= + +The ``kvaser_pciefd`` driver reports the following versions + +.. list-table:: devlink info versions implemented + :widths: 5 5 90 + + * - Name + - Type + - Description + * - ``fw`` + - running + - Version of the firmware running on the device. Also available + through ``ethtool -i`` as ``firmware-version``. diff --git a/Documentation/networking/devlink/kvaser_usb.rst b/Documentation/networking/devlink/kvaser_usb.rst new file mode 100644 index 000000000000..403db3766cb4 --- /dev/null +++ b/Documentation/networking/devlink/kvaser_usb.rst @@ -0,0 +1,33 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================== +kvaser_usb devlink support +========================== + +This document describes the devlink features implemented by the +``kvaser_usb`` device driver. + +Info versions +============= + +The ``kvaser_usb`` driver reports the following versions + +.. list-table:: devlink info versions implemented + :widths: 5 5 90 + + * - Name + - Type + - Description + * - ``fw`` + - running + - Version of the firmware running on the device. Also available + through ``ethtool -i`` as ``firmware-version``. + * - ``board.rev`` + - fixed + - The device hardware revision. + * - ``board.id`` + - fixed + - The device EAN (product number). + * - ``serial_number`` + - fixed + - The device serial number. diff --git a/Documentation/networking/devlink/netdevsim.rst b/Documentation/networking/devlink/netdevsim.rst index 88482725422c..3932004eae82 100644 --- a/Documentation/networking/devlink/netdevsim.rst +++ b/Documentation/networking/devlink/netdevsim.rst @@ -62,7 +62,7 @@ Rate objects The ``netdevsim`` driver supports rate objects management, which includes: -- registerging/unregistering leaf rate objects per VF devlink port; +- registering/unregistering leaf rate objects per VF devlink port; - creation/deletion node rate objects; - setting tx_share and tx_max rate values for any rate object type; - setting parent node for any rate object type. diff --git a/Documentation/networking/devlink/zl3073x.rst b/Documentation/networking/devlink/zl3073x.rst new file mode 100644 index 000000000000..4b6cfaf38643 --- /dev/null +++ b/Documentation/networking/devlink/zl3073x.rst @@ -0,0 +1,51 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================= +zl3073x devlink support +======================= + +This document describes the devlink features implemented by the ``zl3073x`` +device driver. + +Parameters +========== + +.. list-table:: Generic parameters implemented + :widths: 5 5 90 + + * - Name + - Mode + - Notes + * - ``clock_id`` + - driverinit + - Set the clock ID that is used by the driver for registering DPLL devices + and pins. + +Info versions +============= + +The ``zl3073x`` driver reports the following versions + +.. list-table:: devlink info versions implemented + :widths: 5 5 5 90 + + * - Name + - Type + - Example + - Description + * - ``asic.id`` + - fixed + - 1E94 + - Chip identification number + * - ``asic.rev`` + - fixed + - 300 + - Chip revision number + * - ``fw`` + - running + - 7006 + - Firmware version number + * - ``custom_cfg`` + - running + - 1.3.0.1 + - Device configuration version customized by OEM diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst index b6e9af4d0f1b..ab20c644af24 100644 --- a/Documentation/networking/ethtool-netlink.rst +++ b/Documentation/networking/ethtool-netlink.rst @@ -239,6 +239,9 @@ Userspace to kernel: ``ETHTOOL_MSG_PHY_GET`` get Ethernet PHY information ``ETHTOOL_MSG_TSCONFIG_GET`` get hw timestamping configuration ``ETHTOOL_MSG_TSCONFIG_SET`` set hw timestamping configuration + ``ETHTOOL_MSG_RSS_SET`` set RSS settings + ``ETHTOOL_MSG_RSS_CREATE_ACT`` create an additional RSS context + ``ETHTOOL_MSG_RSS_DELETE_ACT`` delete an additional RSS context ===================================== ================================= Kernel to userspace: @@ -281,6 +284,7 @@ Kernel to userspace: ``ETHTOOL_MSG_MODULE_GET_REPLY`` transceiver module parameters ``ETHTOOL_MSG_PSE_GET_REPLY`` PSE parameters ``ETHTOOL_MSG_RSS_GET_REPLY`` RSS settings + ``ETHTOOL_MSG_RSS_NTF`` RSS settings ``ETHTOOL_MSG_PLCA_GET_CFG_REPLY`` PLCA RS parameters ``ETHTOOL_MSG_PLCA_GET_STATUS_REPLY`` PLCA RS status ``ETHTOOL_MSG_PLCA_NTF`` PLCA RS parameters @@ -290,6 +294,11 @@ Kernel to userspace: ``ETHTOOL_MSG_PHY_NTF`` Ethernet PHY information change ``ETHTOOL_MSG_TSCONFIG_GET_REPLY`` hw timestamping configuration ``ETHTOOL_MSG_TSCONFIG_SET_REPLY`` new hw timestamping configuration + ``ETHTOOL_MSG_PSE_NTF`` PSE events notification + ``ETHTOOL_MSG_RSS_NTF`` RSS settings notification + ``ETHTOOL_MSG_RSS_CREATE_ACT_REPLY`` create an additional RSS context + ``ETHTOOL_MSG_RSS_CREATE_NTF`` additional RSS context created + ``ETHTOOL_MSG_RSS_DELETE_NTF`` additional RSS context deleted ======================================== ================================= ``GET`` requests are sent by userspace applications to retrieve device @@ -1788,6 +1797,11 @@ Kernel response contents: limit of the PoE PSE. ``ETHTOOL_A_C33_PSE_PW_LIMIT_RANGES`` nested Supported power limit configuration ranges. + ``ETHTOOL_A_PSE_PW_D_ID`` u32 Index of the PSE power domain + ``ETHTOOL_A_PSE_PRIO_MAX`` u32 Priority maximum configurable + on the PoE PSE + ``ETHTOOL_A_PSE_PRIO`` u32 Priority of the PoE PSE + currently configured ========================================== ====== ============================= When set, the optional ``ETHTOOL_A_PODL_PSE_ADMIN_STATE`` attribute identifies @@ -1861,6 +1875,15 @@ identifies the C33 PSE power limit ranges through If the controller works with fixed classes, the min and max values will be equal. +The ``ETHTOOL_A_PSE_PW_D_ID`` attribute identifies the index of PSE power +domain. + +When set, the optional ``ETHTOOL_A_PSE_PRIO_MAX`` attribute identifies +the PSE maximum priority value. +When set, the optional ``ETHTOOL_A_PSE_PRIO`` attributes is used to +identifies the currently configured PSE priority. +For a description of PSE priority attributes, see ``PSE_SET``. + PSE_SET ======= @@ -1874,6 +1897,8 @@ Request contents: ``ETHTOOL_A_C33_PSE_ADMIN_CONTROL`` u32 Control PSE Admin state ``ETHTOOL_A_C33_PSE_AVAIL_PWR_LIMIT`` u32 Control PoE PSE available power limit + ``ETHTOOL_A_PSE_PRIO`` u32 Control priority of the + PoE PSE ====================================== ====== ============================= When set, the optional ``ETHTOOL_A_PODL_PSE_ADMIN_CONTROL`` attribute is used @@ -1896,6 +1921,38 @@ various existing products that document power consumption in watts rather than classes. If power limit configuration based on classes is needed, the conversion can be done in user space, for example by ethtool. +When set, the optional ``ETHTOOL_A_PSE_PRIO`` attributes is used to +control the PSE priority. Allowed priority value are between zero and +the value of ``ETHTOOL_A_PSE_PRIO_MAX`` attribute. + +A lower value indicates a higher priority, meaning that a priority value +of 0 corresponds to the highest port priority. +Port priority serves two functions: + + - Power-up Order: After a reset, ports are powered up in order of their + priority from highest to lowest. Ports with higher priority + (lower values) power up first. + - Shutdown Order: When the power budget is exceeded, ports with lower + priority (higher values) are turned off first. + +PSE_NTF +======= + +Notify PSE events. + +Notification contents: + + =============================== ====== ======================== + ``ETHTOOL_A_PSE_HEADER`` nested request header + ``ETHTOOL_A_PSE_EVENTS`` bitset PSE events + =============================== ====== ======================== + +When set, the optional ``ETHTOOL_A_PSE_EVENTS`` attribute identifies the +PSE events. + +.. kernel-doc:: include/uapi/linux/ethtool_netlink_generated.h + :identifiers: ethtool_pse_event + RSS_GET ======= @@ -1919,14 +1976,15 @@ used to ignore context 0s and only dump additional contexts). Kernel response contents: -===================================== ====== ========================== +===================================== ====== =============================== ``ETHTOOL_A_RSS_HEADER`` nested reply header ``ETHTOOL_A_RSS_CONTEXT`` u32 context number ``ETHTOOL_A_RSS_HFUNC`` u32 RSS hash func ``ETHTOOL_A_RSS_INDIR`` binary Indir table bytes ``ETHTOOL_A_RSS_HKEY`` binary Hash key bytes ``ETHTOOL_A_RSS_INPUT_XFRM`` u32 RSS input data transformation -===================================== ====== ========================== + ``ETHTOOL_A_RSS_FLOW_HASH`` nested Header fields included in hash +===================================== ====== =============================== ETHTOOL_A_RSS_HFUNC attribute is bitmap indicating the hash function being used. Current supported options are toeplitz, xor or crc32. @@ -1935,6 +1993,67 @@ indicates queue number. ETHTOOL_A_RSS_INPUT_XFRM attribute is a bitmap indicating the type of transformation applied to the input protocol fields before given to the RSS hfunc. Current supported options are symmetric-xor and symmetric-or-xor. +ETHTOOL_A_RSS_FLOW_HASH carries per-flow type bitmask of which header +fields are included in the hash calculation. + +RSS_SET +======= + +Request contents: + +===================================== ====== ============================== + ``ETHTOOL_A_RSS_HEADER`` nested request header + ``ETHTOOL_A_RSS_CONTEXT`` u32 context number + ``ETHTOOL_A_RSS_HFUNC`` u32 RSS hash func + ``ETHTOOL_A_RSS_INDIR`` binary Indir table bytes + ``ETHTOOL_A_RSS_HKEY`` binary Hash key bytes + ``ETHTOOL_A_RSS_INPUT_XFRM`` u32 RSS input data transformation + ``ETHTOOL_A_RSS_FLOW_HASH`` nested Header fields included in hash +===================================== ====== ============================== + +``ETHTOOL_A_RSS_INDIR`` is the minimal RSS table the user expects. Kernel and +the device driver may replicate the table if its smaller than smallest table +size supported by the device. For example if user requests ``[0, 1]`` but the +device needs at least 8 entries - the real table in use will end up being +``[0, 1, 0, 1, 0, 1, 0, 1]``. Most devices require the table size to be power +of 2, so tables which size is not a power of 2 will likely be rejected. +Using table of size 0 will reset the indirection table to the default. + +RSS_CREATE_ACT +============== + +Request contents: + +===================================== ====== ============================== + ``ETHTOOL_A_RSS_HEADER`` nested request header + ``ETHTOOL_A_RSS_CONTEXT`` u32 context number + ``ETHTOOL_A_RSS_HFUNC`` u32 RSS hash func + ``ETHTOOL_A_RSS_INDIR`` binary Indir table bytes + ``ETHTOOL_A_RSS_HKEY`` binary Hash key bytes + ``ETHTOOL_A_RSS_INPUT_XFRM`` u32 RSS input data transformation +===================================== ====== ============================== + +Kernel response contents: + +===================================== ====== ============================== + ``ETHTOOL_A_RSS_HEADER`` nested request header + ``ETHTOOL_A_RSS_CONTEXT`` u32 context number +===================================== ====== ============================== + +Create an additional RSS context, if ``ETHTOOL_A_RSS_CONTEXT`` is not +specified kernel will allocate one automatically. + +RSS_DELETE_ACT +============== + +Request contents: + +===================================== ====== ============================== + ``ETHTOOL_A_RSS_HEADER`` nested request header + ``ETHTOOL_A_RSS_CONTEXT`` u32 context number +===================================== ====== ============================== + +Delete an additional RSS context. PLCA_GET_CFG ============ @@ -2386,8 +2505,8 @@ are netlink only. ``ETHTOOL_SFLAGS`` ``ETHTOOL_MSG_FEATURES_SET`` ``ETHTOOL_GPFLAGS`` ``ETHTOOL_MSG_PRIVFLAGS_GET`` ``ETHTOOL_SPFLAGS`` ``ETHTOOL_MSG_PRIVFLAGS_SET`` - ``ETHTOOL_GRXFH`` n/a - ``ETHTOOL_SRXFH`` n/a + ``ETHTOOL_GRXFH`` ``ETHTOOL_MSG_RSS_GET`` + ``ETHTOOL_SRXFH`` ``ETHTOOL_MSG_RSS_SET`` ``ETHTOOL_GGRO`` ``ETHTOOL_MSG_FEATURES_GET`` ``ETHTOOL_SGRO`` ``ETHTOOL_MSG_FEATURES_SET`` ``ETHTOOL_GRXRINGS`` n/a @@ -2401,8 +2520,8 @@ are netlink only. ``ETHTOOL_SRXNTUPLE`` n/a ``ETHTOOL_GRXNTUPLE`` n/a ``ETHTOOL_GSSET_INFO`` ``ETHTOOL_MSG_STRSET_GET`` - ``ETHTOOL_GRXFHINDIR`` n/a - ``ETHTOOL_SRXFHINDIR`` n/a + ``ETHTOOL_GRXFHINDIR`` ``ETHTOOL_MSG_RSS_GET`` + ``ETHTOOL_SRXFHINDIR`` ``ETHTOOL_MSG_RSS_SET`` ``ETHTOOL_GFEATURES`` ``ETHTOOL_MSG_FEATURES_GET`` ``ETHTOOL_SFEATURES`` ``ETHTOOL_MSG_FEATURES_SET`` ``ETHTOOL_GCHANNELS`` ``ETHTOOL_MSG_CHANNELS_GET`` diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst index 0f1251cce314..bb620f554598 100644 --- a/Documentation/networking/ip-sysctl.rst +++ b/Documentation/networking/ip-sysctl.rst @@ -8,15 +8,19 @@ IP Sysctl ============================== ip_forward - BOOLEAN - - 0 - disabled (default) - - not 0 - enabled - Forward Packets between interfaces. This variable is special, its change resets all configuration parameters to their default state (RFC1122 for hosts, RFC1812 for routers) + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 0 (disabled) + ip_default_ttl - INTEGER Default value of TTL field (Time To Live) for outgoing (but not forwarded) IP packets. Should be between 1 and 255 inclusive. @@ -62,20 +66,25 @@ ip_forward_use_pmtu - BOOLEAN kernel honoring this information. This is normally not the case. - Default: 0 (disabled) - Possible values: - - 0 - disabled - - 1 - enabled + - 0 (disabled) + - 1 (enabled) + + Default: 0 (disabled) fwmark_reflect - BOOLEAN Controls the fwmark of kernel-generated IPv4 reply packets that are not associated with a socket for example, TCP RSTs or ICMP echo replies). - If unset, these packets have a fwmark of zero. If set, they have the + If disabled, these packets have a fwmark of zero. If enabled, they have the fwmark of the packet they are replying to. - Default: 0 + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 0 (disabled) fib_multipath_use_neigh - BOOLEAN Use status of existing neighbor entry when determining nexthop for @@ -83,12 +92,12 @@ fib_multipath_use_neigh - BOOLEAN packets could be directed to a failed nexthop. Only valid for kernels built with CONFIG_IP_ROUTE_MULTIPATH enabled. - Default: 0 (disabled) - Possible values: - - 0 - disabled - - 1 - enabled + - 0 (disabled) + - 1 (enabled) + + Default: 0 (disabled) fib_multipath_hash_policy - INTEGER Controls which hash policy to use for multipath routes. Only valid @@ -368,7 +377,12 @@ tcp_autocorking - BOOLEAN queue. Applications can still use TCP_CORK for optimal behavior when they know how/when to uncork their sockets. - Default : 1 + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 1 (enabled) tcp_available_congestion_control - STRING Shows the available congestion control choices that are registered. @@ -408,9 +422,16 @@ tcp_congestion_control - STRING tcp_dsack - BOOLEAN Allows TCP to send "duplicate" SACKs. + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 1 (enabled) + tcp_early_retrans - INTEGER Tail loss probe (TLP) converts RTOs occurring due to tail - losses into fast recovery (draft-ietf-tcpm-rack). Note that + losses into fast recovery (RFC8985). Note that TLP requires RACK to function properly (see tcp_recovery below) Possible values: @@ -447,7 +468,12 @@ tcp_ecn_fallback - BOOLEAN knob. The value is not used, if tcp_ecn or per route (or congestion control) ECN settings are disabled. - Default: 1 (fallback enabled) + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 1 (enabled) tcp_fack - BOOLEAN This is a legacy option, it has no effect anymore. @@ -474,7 +500,7 @@ tcp_frto - INTEGER By default it's enabled with a non-zero value. 0 disables F-RTO. tcp_fwmark_accept - BOOLEAN - If set, incoming connections to listening sockets that do not have a + If enabled, incoming connections to listening sockets that do not have a socket mark will set the mark of the accepting socket to the fwmark of the incoming SYN packet. This will cause all packets on that connection (starting from the first SYNACK) to be sent with that fwmark. The @@ -482,7 +508,12 @@ tcp_fwmark_accept - BOOLEAN have a fwmark set via setsockopt(SOL_SOCKET, SO_MARK, ...) are unaffected. - Default: 0 + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 0 (disabled) tcp_invalid_ratelimit - INTEGER Limit the maximal rate for sending duplicate acknowledgments @@ -528,6 +559,11 @@ tcp_l3mdev_accept - BOOLEAN which the packets originated. Only valid when the kernel was compiled with CONFIG_NET_L3_MASTER_DEV. + Possible values: + + - 0 (disabled) + - 1 (enabled) + Default: 0 (disabled) tcp_low_latency - BOOLEAN @@ -593,10 +629,16 @@ tcp_min_rtt_wlen - INTEGER Default: 300 tcp_moderate_rcvbuf - BOOLEAN - If set, TCP performs receive buffer auto-tuning, attempting to + If enabled, TCP performs receive buffer auto-tuning, attempting to automatically size the buffer (no greater than tcp_rmem[2]) to - match the size required by the path for full throughput. Enabled by - default. + match the size required by the path for full throughput. + + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 1 (enabled) tcp_mtu_probing - INTEGER Controls TCP Packetization-Layer Path MTU Discovery. Takes three @@ -621,13 +663,26 @@ tcp_no_metrics_save - BOOLEAN when the connection closes, so that connections established in the near future can use these to set initial conditions. Usually, this increases overall performance, but may sometimes cause performance - degradation. If set, TCP will not cache metrics on closing + degradation. If enabled, TCP will not cache metrics on closing connections. + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 0 (disabled) + tcp_no_ssthresh_metrics_save - BOOLEAN Controls whether TCP saves ssthresh metrics in the route cache. + If enabled, ssthresh metrics are disabled. + + Possible values: + + - 0 (disabled) + - 1 (enabled) - Default is 1, which disables ssthresh metrics. + Default: 1 (enabled) tcp_orphan_retries - INTEGER This value influences the timeout of a locally closed TCP connection, @@ -645,9 +700,11 @@ tcp_recovery - INTEGER features. ========= ============================================================= - RACK: 0x1 enables the RACK loss detection for fast detection of lost - retransmissions and tail drops. It also subsumes and disables - RFC6675 recovery for SACK connections. + RACK: 0x1 enables RACK loss detection, for fast detection of lost + retransmissions and tail drops, and resilience to + reordering. currently, setting this bit to 0 has no + effect, since RACK is the only supported loss detection + algorithm. RACK: 0x2 makes RACK's reordering window static (min_rtt/4). @@ -664,6 +721,11 @@ tcp_reflect_tos - BOOLEAN This options affects both IPv4 and IPv6. + Possible values: + + - 0 (disabled) + - 1 (enabled) + Default: 0 (disabled) tcp_reordering - INTEGER @@ -685,6 +747,13 @@ tcp_retrans_collapse - BOOLEAN On retransmit try to send bigger packets to work around bugs in certain TCP stacks. + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 1 (enabled) + tcp_retries1 - INTEGER This value influences the time, after which TCP decides, that something is wrong due to unacknowledged RTO retransmissions, @@ -712,11 +781,16 @@ tcp_retries2 - INTEGER which corresponds to a value of at least 8. tcp_rfc1337 - BOOLEAN - If set, the TCP stack behaves conforming to RFC1337. If unset, + If enabled, the TCP stack behaves conforming to RFC1337. If unset, we are not conforming to RFC, but prevent TCP TIME_WAIT assassination. - Default: 0 + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 0 (disabled) tcp_rmem - vector of 3 INTEGERs: min, default, max min: Minimal size of receive buffer used by TCP sockets. @@ -740,6 +814,13 @@ tcp_rmem - vector of 3 INTEGERs: min, default, max tcp_sack - BOOLEAN Enable select acknowledgments (SACKS). + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 1 (enabled) + tcp_comp_sack_delay_ns - LONG INTEGER TCP tries to reduce number of SACK sent, using a timer based on 5% of SRTT, capped by this sysctl, in nano seconds. @@ -762,26 +843,41 @@ tcp_comp_sack_nr - INTEGER Default : 44 tcp_backlog_ack_defer - BOOLEAN - If set, user thread processing socket backlog tries sending + If enabled, user thread processing socket backlog tries sending one ACK for the whole queue. This helps to avoid potential long latencies at end of a TCP socket syscall. - Default : true + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 1 (enabled) tcp_slow_start_after_idle - BOOLEAN - If set, provide RFC2861 behavior and time out the congestion + If enabled, provide RFC2861 behavior and time out the congestion window after an idle period. An idle period is defined at the current RTO. If unset, the congestion window will not be timed out after an idle period. - Default: 1 + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 1 (enabled) tcp_stdurg - BOOLEAN Use the Host requirements interpretation of the TCP urgent pointer field. - Most hosts use the older BSD interpretation, so if you turn this on + Most hosts use the older BSD interpretation, so if enabled, Linux might not communicate correctly with them. - Default: FALSE + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 0 (disabled) tcp_synack_retries - INTEGER Number of times SYNACKs for a passive TCP connection attempt will @@ -838,7 +934,12 @@ tcp_migrate_req - BOOLEAN migration by returning SK_DROP in the type of eBPF program, or disable this option. - Default: 0 + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 0 (disabled) tcp_fastopen - INTEGER Enable TCP Fast Open (RFC7413) to send and accept data in the opening @@ -1019,6 +1120,13 @@ tcp_tw_reuse_delay - UNSIGNED INTEGER tcp_window_scaling - BOOLEAN Enable window scaling as defined in RFC1323. + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 1 (enabled) + tcp_shrink_window - BOOLEAN This changes how the TCP receive window is calculated. @@ -1026,13 +1134,15 @@ tcp_shrink_window - BOOLEAN window can be offered, and that TCP implementations MUST ensure that they handle a shrinking window, as specified in RFC 1122. - - 0 - Disabled. The window is never shrunk. - - 1 - Enabled. The window is shrunk when necessary to remain within - the memory limit set by autotuning (sk_rcvbuf). - This only occurs if a non-zero receive window - scaling factor is also in effect. + Possible values: + + - 0 (disabled) - The window is never shrunk. + - 1 (enabled) - The window is shrunk when necessary to remain within + the memory limit set by autotuning (sk_rcvbuf). + This only occurs if a non-zero receive window + scaling factor is also in effect. - Default: 0 + Default: 0 (disabled) tcp_wmem - vector of 3 INTEGERs: min, default, max min: Amount of memory reserved for send buffers for TCP sockets. @@ -1069,16 +1179,21 @@ tcp_notsent_lowat - UNSIGNED INTEGER Default: UINT_MAX (0xFFFFFFFF) tcp_workaround_signed_windows - BOOLEAN - If set, assume no receipt of a window scaling option means the + If enabled, assume no receipt of a window scaling option means the remote TCP is broken and treats the window as a signed quantity. - If unset, assume the remote TCP is not broken even if we do + If disabled, assume the remote TCP is not broken even if we do not receive a window scaling option from them. - Default: 0 + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 0 (disabled) tcp_thin_linear_timeouts - BOOLEAN Enable dynamic triggering of linear timeouts for thin streams. - If set, a check is performed upon retransmission by timeout to + If enabled, a check is performed upon retransmission by timeout to determine if the stream is thin (less than 4 packets in flight). As long as the stream is found to be thin, up to 6 linear timeouts may be performed before exponential backoff mode is @@ -1087,7 +1202,12 @@ tcp_thin_linear_timeouts - BOOLEAN For more information on thin streams, see Documentation/networking/tcp-thin.rst - Default: 0 + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 0 (disabled) tcp_limit_output_bytes - INTEGER Controls TCP Small Queue limit per tcp socket. @@ -1139,7 +1259,7 @@ tcp_child_ehash_entries - INTEGER Default: 0 tcp_plb_enabled - BOOLEAN - If set and the underlying congestion control (e.g. DCTCP) supports + If enabled and the underlying congestion control (e.g. DCTCP) supports and enables PLB feature, TCP PLB (Protective Load Balancing) is enabled. PLB is described in the following paper: https://doi.org/10.1145/3544216.3544226. Based on PLB parameters, @@ -1155,12 +1275,17 @@ tcp_plb_enabled - BOOLEAN by switches to determine next hop. In either case, further host and switch side changes will be needed. - When set, PLB assumes that congestion signal (e.g. ECN) is made + If enabled, PLB assumes that congestion signal (e.g. ECN) is made available and used by congestion control module to estimate a congestion measure (e.g. ce_ratio). PLB needs a congestion measure to make repathing decisions. - Default: FALSE + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 0 (disabled) tcp_plb_idle_rehash_rounds - INTEGER Number of consecutive congested rounds (RTT) seen after which @@ -1260,6 +1385,11 @@ udp_l3mdev_accept - BOOLEAN originated. Only valid when the kernel was compiled with CONFIG_NET_L3_MASTER_DEV. + Possible values: + + - 0 (disabled) + - 1 (enabled) + Default: 0 (disabled) udp_mem - vector of 3 INTEGERs: min, pressure, max @@ -1320,19 +1450,29 @@ raw_l3mdev_accept - BOOLEAN originated. Only valid when the kernel was compiled with CONFIG_NET_L3_MASTER_DEV. + Possible values: + + - 0 (disabled) + - 1 (enabled) + Default: 1 (enabled) CIPSOv4 Variables ================= cipso_cache_enable - BOOLEAN - If set, enable additions to and lookups from the CIPSO label mapping - cache. If unset, additions are ignored and lookups always result in a + If enabled, enable additions to and lookups from the CIPSO label mapping + cache. If disabled, additions are ignored and lookups always result in a miss. However, regardless of the setting the cache is still invalidated when required when means you can safely toggle this on and off and the cache will always be "safe". - Default: 1 + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 1 (enabled) cipso_cache_bucket_size - INTEGER The CIPSO label cache consists of a fixed size hash table with each @@ -1350,17 +1490,27 @@ cipso_rbm_optfmt - BOOLEAN This means that when set the CIPSO tag will be padded with empty categories in order to make the packet data 32-bit aligned. - Default: 0 + Possible values: + + - 0 (disabled) + - 1 (enabled) -cipso_rbm_structvalid - BOOLEAN - If set, do a very strict check of the CIPSO option when - ip_options_compile() is called. If unset, relax the checks done during + Default: 0 (disabled) + +cipso_rbm_strictvalid - BOOLEAN + If enabled, do a very strict check of the CIPSO option when + ip_options_compile() is called. If disabled, relax the checks done during ip_options_compile(). Either way is "safe" as errors are caught else where in the CIPSO processing code but setting this to 0 (False) should result in less work (i.e. it should be faster) but could cause problems with other implementations that require strict checking. - Default: 0 + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 0 (disabled) IP Variables ============ @@ -1417,10 +1567,15 @@ ip_unprivileged_port_start - INTEGER Default: 1024 ip_nonlocal_bind - BOOLEAN - If set, allows processes to bind() to non-local IP addresses, + If enabled, allows processes to bind() to non-local IP addresses, which can be quite useful - but may break some applications. - Default: 0 + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 0 (disabled) ip_autobind_reuse - BOOLEAN By default, bind() does not select the ports automatically even if @@ -1429,7 +1584,13 @@ ip_autobind_reuse - BOOLEAN when you use bind()+connect(), but may break some applications. The preferred solution is to use IP_BIND_ADDRESS_NO_PORT and this option should only be set by experts. - Default: 0 + + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 0 (disabled) ip_dynaddr - INTEGER If set non-zero, enables support for dynamic addresses. @@ -1447,7 +1608,12 @@ ip_early_demux - BOOLEAN It may add an additional cost for pure routing workloads that reduces overall throughput, in such case you should disable it. - Default: 1 + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 1 (enabled) ping_group_range - 2 INTEGERS Restrict ICMP_PROTO datagram sockets to users in the group range. @@ -1459,31 +1625,56 @@ ping_group_range - 2 INTEGERS tcp_early_demux - BOOLEAN Enable early demux for established TCP sockets. - Default: 1 + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 1 (enabled) udp_early_demux - BOOLEAN Enable early demux for connected UDP sockets. Disable this if your system could experience more unconnected load. - Default: 1 + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 1 (enabled) icmp_echo_ignore_all - BOOLEAN - If set non-zero, then the kernel will ignore all ICMP ECHO + If enabled, then the kernel will ignore all ICMP ECHO requests sent to it. - Default: 0 + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 0 (disabled) icmp_echo_enable_probe - BOOLEAN - If set to one, then the kernel will respond to RFC 8335 PROBE + If enabled, then the kernel will respond to RFC 8335 PROBE requests sent to it. - Default: 0 + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 0 (disabled) icmp_echo_ignore_broadcasts - BOOLEAN - If set non-zero, then the kernel will ignore all ICMP ECHO and + If enabled, then the kernel will ignore all ICMP ECHO and TIMESTAMP requests sent to it via broadcast/multicast. - Default: 1 + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 1 (enabled) icmp_ratelimit - INTEGER Limit the maximal rates for sending ICMP packets whose type matches @@ -1540,17 +1731,22 @@ icmp_ratemask - INTEGER icmp_ignore_bogus_error_responses - BOOLEAN Some routers violate RFC1122 by sending bogus responses to broadcast frames. Such violations are normally logged via a kernel warning. - If this is set to TRUE, the kernel will not give such warnings, which + If enabled, the kernel will not give such warnings, which will avoid log file clutter. - Default: 1 + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 1 (enabled) icmp_errors_use_inbound_ifaddr - BOOLEAN - If zero, icmp error messages are sent with the primary address of + If disabled, icmp error messages are sent with the primary address of the exiting interface. - If non-zero, the message will be sent with the primary address of + If enabled, the message will be sent with the primary address of the interface that received the packet that caused the icmp error. This is the behaviour many network administrators will expect from a router. And it can make debugging complicated network layouts @@ -1560,7 +1756,12 @@ icmp_errors_use_inbound_ifaddr - BOOLEAN then the primary address of the first non-loopback interface that has one will be used regardless of this setting. - Default: 0 + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 0 (disabled) igmp_max_memberships - INTEGER Change the maximum number of multicast groups we can subscribe to. @@ -1690,10 +1891,10 @@ proxy_arp_pvlan - BOOLEAN This technology is known by different names: - In RFC 3069 it is called VLAN Aggregation. - Cisco and Allied Telesyn call it Private VLAN. - Hewlett-Packard call it Source-Port filtering or port-isolation. - Ericsson call it MAC-Forced Forwarding (RFC Draft). + - In RFC 3069 it is called VLAN Aggregation. + - Cisco and Allied Telesyn call it Private VLAN. + - Hewlett-Packard call it Source-Port filtering or port-isolation. + - Ericsson call it MAC-Forced Forwarding (RFC Draft). proxy_delay - INTEGER Delay proxy response. @@ -1910,8 +2111,12 @@ arp_evict_nocarrier - BOOLEAN between access points on the same network. In most cases this should remain as the default (1). - - 1 - (default): Clear the ARP cache on NOCARRIER events - - 0 - Do not clear ARP cache on NOCARRIER events + Possible values: + + - 0 (disabled) - Do not clear ARP cache on NOCARRIER events + - 1 (enabled) - Clear the ARP cache on NOCARRIER events + + Default: 1 (enabled) mcast_solicit - INTEGER The maximum number of multicast probes in INCOMPLETE state, @@ -1934,9 +2139,23 @@ mcast_resolicit - INTEGER disable_policy - BOOLEAN Disable IPSEC policy (SPD) for this interface + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 0 (disabled) + disable_xfrm - BOOLEAN Disable IPSEC encryption on this interface, whatever the policy + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 0 (disabled) + igmpv2_unsolicited_report_interval - INTEGER The interval in milliseconds in which the next unsolicited IGMPv1 or IGMPv2 report retransmit will take place. @@ -1952,11 +2171,25 @@ igmpv3_unsolicited_report_interval - INTEGER ignore_routes_with_linkdown - BOOLEAN Ignore routes whose link is down when performing a FIB lookup. + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 0 (disabled) + promote_secondaries - BOOLEAN When a primary IP address is removed from this interface promote a corresponding secondary IP address instead of removing all the corresponding secondary IP addresses. + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 0 (disabled) + drop_unicast_in_l2_multicast - BOOLEAN Drop any unicast IP packets that are received in link-layer multicast (or broadcast) frames. @@ -1964,14 +2197,24 @@ drop_unicast_in_l2_multicast - BOOLEAN This behavior (for multicast) is actually a SHOULD in RFC 1122, but is disabled by default for compatibility reasons. - Default: off (0) + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 0 (disabled) drop_gratuitous_arp - BOOLEAN Drop all gratuitous ARP frames, for example if there's a known good ARP proxy on the network and such frames need not be used (or in the case of 802.11, must not be used to prevent attacks.) - Default: off (0) + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 0 (disabled) tag - INTEGER @@ -2015,20 +2258,24 @@ bindv6only - BOOLEAN which restricts use of the IPv6 socket to IPv6 communication only. - - TRUE: disable IPv4-mapped address feature - - FALSE: enable IPv4-mapped address feature + Possible values: + + - 0 (disabled) - enable IPv4-mapped address feature + - 1 (enabled) - disable IPv4-mapped address feature - Default: FALSE (as specified in RFC3493) + Default: 0 (disabled) flowlabel_consistency - BOOLEAN Protect the consistency (and unicity) of flow label. You have to disable it to use IPV6_FL_F_REFLECT flag on the flow label manager. - - TRUE: enabled - - FALSE: disabled + Possible values: - Default: TRUE + - 0 (disabled) + - 1 (enabled) + + Default: 1 (enabled) auto_flowlabels - INTEGER Automatically generate flow labels based on a flow hash of the @@ -2054,10 +2301,13 @@ flowlabel_state_ranges - BOOLEAN reserved for the IPv6 flow manager facility, 0x80000-0xFFFFF is reserved for stateless flow labels as described in RFC6437. - - TRUE: enabled - - FALSE: disabled + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 1 (enabled) - Default: true flowlabel_reflect - INTEGER Control flow label reflection. Needed for Path MTU @@ -2125,10 +2375,13 @@ anycast_src_echo_reply - BOOLEAN Controls the use of anycast addresses as source addresses for ICMPv6 echo reply - - TRUE: enabled - - FALSE: disabled + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 0 (disabled) - Default: FALSE idgen_delay - INTEGER Controls the delay in seconds after which time to retry @@ -2185,7 +2438,12 @@ skip_notify_on_dev_down - BOOLEAN to true skips the message, making IPv4 and IPv6 on par in relying on userspace caches to track link events and evict routes. - Default: false (generate message) + Possible values: + + - 0 (disabled) - generate the message + - 1 (enabled) - skip generating the message + + Default: 0 (disabled) nexthop_compat_mode - BOOLEAN New nexthop API provides a means for managing nexthops independent of @@ -2229,8 +2487,10 @@ fib_notify_on_flag_change - INTEGER ioam6_id - INTEGER Define the IOAM id of this node. Uses only 24 bits out of 32 in total. - Min: 0 - Max: 0xFFFFFF + Possible value range: + + - Min: 0 + - Max: 0xFFFFFF Default: 0xFFFFFF @@ -2238,8 +2498,10 @@ ioam6_id_wide - LONG INTEGER Define the wide IOAM id of this node. Uses only 56 bits out of 64 in total. Can be different from ioam6_id. - Min: 0 - Max: 0xFFFFFFFFFFFFFF + Possible value range: + + - Min: 0 + - Max: 0xFFFFFFFFFFFFFF Default: 0xFFFFFFFFFFFFFF @@ -2281,8 +2543,8 @@ conf/all/disable_ipv6 - BOOLEAN conf/all/forwarding - BOOLEAN Enable global IPv6 forwarding between all interfaces. - IPv4 and IPv6 work differently here; e.g. netfilter must be used - to control which interfaces may forward packets and which not. + IPv4 and IPv6 work differently here; the ``force_forwarding`` flag must + be used to control which interfaces may forward packets. This also sets all interfaces' Host/Router setting 'forwarding' to the specified value. See below for details. @@ -2292,13 +2554,30 @@ conf/all/forwarding - BOOLEAN proxy_ndp - BOOLEAN Do proxy ndp. + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 0 (disabled) + +force_forwarding - BOOLEAN + Enable forwarding on this interface only -- regardless of the setting on + ``conf/all/forwarding``. When setting ``conf.all.forwarding`` to 0, + the ``force_forwarding`` flag will be reset on all interfaces. + fwmark_reflect - BOOLEAN Controls the fwmark of kernel-generated IPv6 reply packets that are not associated with a socket for example, TCP RSTs or ICMPv6 echo replies). - If unset, these packets have a fwmark of zero. If set, they have the + If disabled, these packets have a fwmark of zero. If enabled, they have the fwmark of the packet they are replying to. - Default: 0 + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 0 (disabled) ``conf/interface/*``: Change special settings per interface. @@ -2389,9 +2668,11 @@ ra_honor_pio_life - BOOLEAN lifetime of an address matching a prefix sent in a Router Advertisement Prefix Information Option. - - If enabled, the PIO valid lifetime will always be honored. - - If disabled, RFC4862 section 5.5.3e is used to determine + Possible values: + + - 0 (disabled) - RFC4862 section 5.5.3e is used to determine the valid lifetime of the address. + - 1 (enabled) - the PIO valid lifetime will always be honored. Default: 0 (disabled) @@ -2403,8 +2684,10 @@ ra_honor_pio_pflag - BOOLEAN P-flag suppresses any effects of the A-flag within the same PIO. For a given PIO, P=1 and A=1 is treated as A=0. - - If disabled, the P-flag is ignored. - - If enabled, the P-flag will disable SLAAC autoconfiguration + Possible values: + + - 0 (disabled) - the P-flag is ignored. + - 1 (enabled) - the P-flag will disable SLAAC autoconfiguration for the given Prefix Information Option. Default: 0 (disabled) @@ -2526,10 +2809,15 @@ mtu - INTEGER Default: 1280 (IPv6 required minimum) ip_nonlocal_bind - BOOLEAN - If set, allows processes to bind() to non-local IPv6 addresses, + If enabled, allows processes to bind() to non-local IPv6 addresses, which can be quite useful - but may break some applications. - Default: 0 + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 0 (disabled) router_probe_interval - INTEGER Minimum interval (in seconds) between Router Probing described @@ -2559,7 +2847,12 @@ use_oif_addrs_only - BOOLEAN routed via this interface are restricted to the set of addresses configured on this interface (vis. RFC 6724, section 4). - Default: false + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 0 (disabled) use_tempaddr - INTEGER Preference for Privacy Extensions (RFC3041). @@ -2684,10 +2977,14 @@ force_tllao - BOOLEAN ndisc_notify - BOOLEAN Define mode for notification of address and device changes. - * 0 - (default): do nothing - * 1 - Generate unsolicited neighbour advertisements when device is brought + Possible values: + + - 0 (disabled) - do nothing + - 1 (enabled) - Generate unsolicited neighbour advertisements when device is brought up or hardware address changes. + Default: 0 (disabled) + ndisc_tclass - INTEGER The IPv6 Traffic Class to use by default when sending IPv6 Neighbor Discovery (Router Solicitation, Router Advertisement, Neighbor @@ -2704,8 +3001,12 @@ ndisc_evict_nocarrier - BOOLEAN not be cleared when roaming between access points on the same network. In most cases this should remain as the default (1). - - 1 - (default): Clear neighbor discover cache on NOCARRIER events. - - 0 - Do not clear neighbor discovery cache on NOCARRIER events. + Possible values: + + - 0 (disabled) - Do not clear neighbor discovery cache on NOCARRIER events. + - 1 (enabled) - Clear neighbor discover cache on NOCARRIER events. + + Default: 1 (enabled) mldv1_unsolicited_report_interval - INTEGER The interval in milliseconds in which the next unsolicited @@ -2734,25 +3035,34 @@ suppress_frag_ndisc - INTEGER optimistic_dad - BOOLEAN Whether to perform Optimistic Duplicate Address Detection (RFC 4429). - * 0: disabled (default) - * 1: enabled - Optimistic Duplicate Address Detection for the interface will be enabled if at least one of conf/{all,interface}/optimistic_dad is set to 1, it will be disabled otherwise. + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 0 (disabled) + + use_optimistic - BOOLEAN If enabled, do not classify optimistic addresses as deprecated during source address selection. Preferred addresses will still be chosen before optimistic addresses, subject to other ranking in the source address selection algorithm. - * 0: disabled (default) - * 1: enabled - This will be enabled if at least one of conf/{all,interface}/use_optimistic is set to 1, disabled otherwise. + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 0 (disabled) + stable_secret - IPv6 address This IPv6 address will be used as a secret to generate IPv6 addresses for link-local addresses and autoconfigured @@ -2783,14 +3093,24 @@ drop_unicast_in_l2_multicast - BOOLEAN Drop any unicast IPv6 packets that are received in link-layer multicast (or broadcast) frames. - By default this is turned off. + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 0 (disabled) drop_unsolicited_na - BOOLEAN Drop all unsolicited neighbor advertisements, for example if there's a known good NA proxy on the network and such frames need not be used (or in the case of 802.11, must not be used to prevent attacks.) - By default this is turned off. + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 0 (disabled). accept_untracked_na - INTEGER Define behavior for accepting neighbor advertisements from devices that @@ -2831,7 +3151,12 @@ enhanced_dad - BOOLEAN The nonce option will be sent on an interface unless both of conf/{all,interface}/enhanced_dad are set to FALSE. - Default: TRUE + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 1 (enabled) ``icmp/*``: =========== @@ -2860,29 +3185,49 @@ ratemask - list of comma separated ranges Default: 0-1,3-127 (rate limit ICMPv6 errors except Packet Too Big) echo_ignore_all - BOOLEAN - If set non-zero, then the kernel will ignore all ICMP ECHO + If enabled, then the kernel will ignore all ICMP ECHO requests sent to it over the IPv6 protocol. - Default: 0 + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 0 (disabled) echo_ignore_multicast - BOOLEAN - If set non-zero, then the kernel will ignore all ICMP ECHO + If enabled, then the kernel will ignore all ICMP ECHO requests sent to it over the IPv6 protocol via multicast. - Default: 0 + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 0 (disabled) echo_ignore_anycast - BOOLEAN - If set non-zero, then the kernel will ignore all ICMP ECHO + If enabled, then the kernel will ignore all ICMP ECHO requests sent to it over the IPv6 protocol destined to anycast address. - Default: 0 + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 0 (disabled) error_anycast_as_unicast - BOOLEAN - If set to 1, then the kernel will respond with ICMP Errors + If enabled, then the kernel will respond with ICMP Errors resulting from requests sent to it over the IPv6 protocol destined to anycast address essentially treating anycast as unicast. - Default: 0 + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 0 (disabled) xfrm6_gc_thresh - INTEGER (Obsolete since linux-4.14) @@ -2900,34 +3245,49 @@ YOSHIFUJI Hideaki / USAGI Project <yoshfuji@linux-ipv6.org> ================================= bridge-nf-call-arptables - BOOLEAN - - 1 : pass bridged ARP traffic to arptables' FORWARD chain. - - 0 : disable this. - Default: 1 + Possible values: + + - 0 (disabled) - disable this. + - 1 (enabled) - pass bridged ARP traffic to arptables' FORWARD chain. + + Default: 1 (enabled) bridge-nf-call-iptables - BOOLEAN - - 1 : pass bridged IPv4 traffic to iptables' chains. - - 0 : disable this. - Default: 1 + Possible values: + + - 0 (disabled) - disable this. + - 1 (enabled) - pass bridged IPv4 traffic to iptables' chains. + + Default: 1 (enabled) bridge-nf-call-ip6tables - BOOLEAN - - 1 : pass bridged IPv6 traffic to ip6tables' chains. - - 0 : disable this. - Default: 1 + Possible values: + + - 0 (disabled) - disable this. + - 1 (enabled) - pass bridged IPv6 traffic to ip6tables' chains. + + Default: 1 (enabled) bridge-nf-filter-vlan-tagged - BOOLEAN - - 1 : pass bridged vlan-tagged ARP/IP/IPv6 traffic to {arp,ip,ip6}tables. - - 0 : disable this. - Default: 0 + Possible values: + + - 0 (disabled) - disable this. + - 1 (enabled) - pass bridged vlan-tagged ARP/IP/IPv6 traffic to {arp,ip,ip6}tables + + Default: 0 (disabled) bridge-nf-filter-pppoe-tagged - BOOLEAN - - 1 : pass bridged pppoe-tagged IP/IPv6 traffic to {ip,ip6}tables. - - 0 : disable this. - Default: 0 + Possible values: + + - 0 (disabled) - disable this. + - 1 (enabled) - pass bridged pppoe-tagged IP/IPv6 traffic to {ip,ip6}tables. + + Default: 0 (disabled) bridge-nf-pass-vlan-input-dev - BOOLEAN - 1: if bridge-nf-filter-vlan-tagged is enabled, try to find a vlan @@ -2950,11 +3310,12 @@ addip_enable - BOOLEAN the ability to dynamically add and remove new addresses for the SCTP associations. - 1: Enable extension. + Possible values: - 0: Disable extension. + - 0 (disabled) - disable extension. + - 1 (enabled) - enable extension - Default: 0 + Default: 0 (disabled) pf_enable - INTEGER Enable or disable pf (pf is short for potentially failed) state. A value @@ -2969,31 +3330,27 @@ pf_enable - INTEGER https://datatracker.ietf.org/doc/draft-ietf-tsvwg-sctp-failover for details. - 1: Enable pf. + Possible values: - 0: Disable pf. + - 1: Enable pf. + - 0: Disable pf. Default: 1 pf_expose - INTEGER Unset or enable/disable pf (pf is short for potentially failed) state exposure. Applications can control the exposure of the PF path state - in the SCTP_PEER_ADDR_CHANGE event and the SCTP_GET_PEER_ADDR_INFO - sockopt. When it's unset, no SCTP_PEER_ADDR_CHANGE event with - SCTP_ADDR_PF state will be sent and a SCTP_PF-state transport info - can be got via SCTP_GET_PEER_ADDR_INFO sockopt; When it's enabled, - a SCTP_PEER_ADDR_CHANGE event will be sent for a transport becoming - SCTP_PF state and a SCTP_PF-state transport info can be got via - SCTP_GET_PEER_ADDR_INFO sockopt; When it's disabled, no - SCTP_PEER_ADDR_CHANGE event will be sent and it returns -EACCES when - trying to get a SCTP_PF-state transport info via SCTP_GET_PEER_ADDR_INFO - sockopt. - - 0: Unset pf state exposure, Compatible with old applications. + in the SCTP_PEER_ADDR_CHANGE event and access of SCTP_PF-state + transport info via SCTP_GET_PEER_ADDR_INFO sockopt. - 1: Disable pf state exposure. + Possible values: - 2: Enable pf state exposure. + - 0: Unset pf state exposure (compatible with old applications). No + event will be sent but the transport info can be queried. + - 1: Disable pf state exposure. No event will be sent and trying to + obtain transport info will return -EACCESS. + - 2: Enable pf state exposure. The event will be sent for a transport + becoming SCTP_PF state and transport info can be obtained. Default: 0 @@ -3023,19 +3380,23 @@ auth_enable - BOOLEAN required for secure operation of Dynamic Address Reconfiguration (ADD-IP) extension. - - 1: Enable this extension. - - 0: Disable this extension. + Possible values: - Default: 0 + - 0 (disabled) - disable extension. + - 1 (enabled) - enable extension + + Default: 0 (disabled) prsctp_enable - BOOLEAN Enable or disable the Partial Reliability extension (RFC3758) which is used to notify peers that a given DATA should no longer be expected. - - 1: Enable extension - - 0: Disable + Possible values: - Default: 1 + - 0 (disabled) - disable extension. + - 1 (enabled) - enable extension + + Default: 1 (enabled) max_burst - INTEGER The limit of the number of new packets that can be initially sent. It @@ -3135,10 +3496,12 @@ cookie_preserve_enable - BOOLEAN Enable or disable the ability to extend the lifetime of the SCTP cookie that is used during the establishment phase of SCTP association - - 1: Enable cookie lifetime extension. - - 0: Disable + Possible values: + + - 0 (disabled) - disable. + - 1 (enabled) - enable cookie lifetime extension. - Default: 1 + Default: 1 (enabled) cookie_hmac_alg - STRING Select the hmac algorithm used when generating the cookie value sent by @@ -3183,13 +3546,11 @@ sndbuf_policy - INTEGER sctp_mem - vector of 3 INTEGERs: min, pressure, max Number of pages allowed for queueing by all SCTP sockets. - min: Below this number of pages SCTP is not bothered about its - memory appetite. When amount of memory allocated by SCTP exceeds - this number, SCTP starts to moderate memory usage. - - pressure: This value was introduced to follow format of tcp_mem. - - max: Number of pages allowed for queueing by all SCTP sockets. + * min: Below this number of pages SCTP is not bothered about its + memory usage. When amount of memory allocated by SCTP exceeds + this number, SCTP starts to moderate memory usage. + * pressure: This value was introduced to follow format of tcp_mem. + * max: Maximum number of allowed pages. Default is calculated at boot time from amount of available memory. @@ -3197,9 +3558,9 @@ sctp_rmem - vector of 3 INTEGERs: min, default, max Only the first value ("min") is used, "default" and "max" are ignored. - min: Minimal size of receive buffer used by SCTP socket. - It is guaranteed to each SCTP socket (but not association) even - under moderate memory pressure. + * min: Minimal size of receive buffer used by SCTP socket. + It is guaranteed to each SCTP socket (but not association) even + under moderate memory pressure. Default: 4K @@ -3207,14 +3568,16 @@ sctp_wmem - vector of 3 INTEGERs: min, default, max Only the first value ("min") is used, "default" and "max" are ignored. - min: Minimum size of send buffer that can be used by SCTP sockets. - It is guaranteed to each SCTP socket (but not association) even - under moderate memory pressure. + * min: Minimum size of send buffer that can be used by SCTP sockets. + It is guaranteed to each SCTP socket (but not association) even + under moderate memory pressure. Default: 4K addr_scope_policy - INTEGER - Control IPv4 address scoping - draft-stewart-tsvwg-sctp-ipv4-00 + Control IPv4 address scoping (see + https://datatracker.ietf.org/doc/draft-stewart-tsvwg-sctp-ipv4/00/ + for details). - 0 - Disable IPv4 address scoping - 1 - Enable IPv4 address scoping @@ -3272,10 +3635,12 @@ reconf_enable - BOOLEAN a stream, and it includes the Parameters of "Outgoing/Incoming SSN Reset", "SSN/TSN Reset" and "Add Outgoing/Incoming Streams". - - 1: Enable extension. - - 0: Disable extension. + Possible values: - Default: 0 + - 0 (disabled) - Disable extension. + - 1 (enabled) - Enable extension. + + Default: 0 (disabled) intl_enable - BOOLEAN Enable or disable extension of User Message Interleaving functionality @@ -3286,10 +3651,12 @@ intl_enable - BOOLEAN to 1 and also needs to set socket options SCTP_FRAGMENT_INTERLEAVE to 2 and SCTP_INTERLEAVING_SUPPORTED to 1. - - 1: Enable extension. - - 0: Disable extension. + Possible values: - Default: 0 + - 0 (disabled) - Disable extension. + - 1 (enabled) - Enable extension. + + Default: 0 (disabled) ecn_enable - BOOLEAN Control use of Explicit Congestion Notification (ECN) by SCTP. @@ -3298,10 +3665,12 @@ ecn_enable - BOOLEAN due to congestion by allowing supporting routers to signal congestion before having to drop packets. - 1: Enable ecn. - 0: Disable ecn. + Possible values: - Default: 1 + - 0 (disabled) - Disable ecn. + - 1 (enabled) - Enable ecn. + + Default: 1 (enabled) l3mdev_accept - BOOLEAN Enabling this option allows a "global" bound socket to work @@ -3310,6 +3679,11 @@ l3mdev_accept - BOOLEAN originated. Only valid when the kernel was compiled with CONFIG_NET_L3_MASTER_DEV. + Possible values: + + - 0 (disabled) + - 1 (enabled) + Default: 1 (enabled) diff --git a/Documentation/networking/napi.rst b/Documentation/networking/napi.rst index d0e3953cae6a..a15754adb041 100644 --- a/Documentation/networking/napi.rst +++ b/Documentation/networking/napi.rst @@ -444,7 +444,14 @@ dependent). The NAPI instance IDs will be assigned in the opposite order than the process IDs of the kernel threads. Threaded NAPI is controlled by writing 0/1 to the ``threaded`` file in -netdev's sysfs directory. +netdev's sysfs directory. It can also be enabled for a specific NAPI using +netlink interface. + +For example, using the script: + +.. code-block:: bash + + $ ynl --family netdev --do napi-set --json='{"id": 66, "threaded": 1}' .. rubric:: Footnotes diff --git a/Documentation/networking/net_cachelines/net_device.rst b/Documentation/networking/net_cachelines/net_device.rst index c69cc89c958e..1c19bb7705df 100644 --- a/Documentation/networking/net_cachelines/net_device.rst +++ b/Documentation/networking/net_cachelines/net_device.rst @@ -68,6 +68,7 @@ unsigned_char addr_assign_type unsigned_char addr_len unsigned_char upper_level unsigned_char lower_level +u8 threaded napi_poll(napi_enable,netif_set_threaded) unsigned_short neigh_priv_len unsigned_short padded unsigned_short dev_id @@ -165,7 +166,6 @@ struct sfp_bus* sfp_bus struct lock_class_key* qdisc_tx_busylock bool proto_down unsigned:1 wol_enabled -unsigned:1 threaded napi_poll(napi_enable,dev_set_threaded) unsigned_long:1 see_all_hwtstamp_requests unsigned_long:1 change_proto_down unsigned_long:1 netns_immutable diff --git a/Documentation/networking/net_cachelines/snmp.rst b/Documentation/networking/net_cachelines/snmp.rst index bd44b3eebbef..bce4eb35ec48 100644 --- a/Documentation/networking/net_cachelines/snmp.rst +++ b/Documentation/networking/net_cachelines/snmp.rst @@ -36,6 +36,7 @@ unsigned_long LINUX_MIB_TIMEWAITRECYCLED unsigned_long LINUX_MIB_TIMEWAITKILLED unsigned_long LINUX_MIB_PAWSACTIVEREJECTED unsigned_long LINUX_MIB_PAWSESTABREJECTED +unsigned_long LINUX_MIB_BEYOND_WINDOW unsigned_long LINUX_MIB_TSECR_REJECTED unsigned_long LINUX_MIB_PAWS_OLD_ACK unsigned_long LINUX_MIB_PAWS_TW_REJECTED diff --git a/Documentation/networking/net_cachelines/tcp_sock.rst b/Documentation/networking/net_cachelines/tcp_sock.rst index bc9b2131bf7a..7bbda5944ee2 100644 --- a/Documentation/networking/net_cachelines/tcp_sock.rst +++ b/Documentation/networking/net_cachelines/tcp_sock.rst @@ -115,7 +115,6 @@ u32 lost_out read_mostly read_m u32 sacked_out read_mostly read_mostly tcp_left_out(tx);tcp_packets_in_flight(tx/rx);tcp_clean_rtx_queue(rx) struct hrtimer pacing_timer struct hrtimer compressed_ack_timer -struct sk_buff* lost_skb_hint read_mostly tcp_clean_rtx_queue struct sk_buff* retransmit_skb_hint read_mostly tcp_clean_rtx_queue struct rb_root out_of_order_queue read_mostly tcp_data_queue,tcp_fast_path_check struct sk_buff* ooo_last_skb @@ -123,7 +122,6 @@ struct tcp_sack_block[1] duplicate_sack struct tcp_sack_block[4] selective_acks struct tcp_sack_block[4] recv_sack_cache struct sk_buff* highest_sack read_write tcp_event_new_data_sent -int lost_cnt_hint u32 prior_ssthresh u32 high_seq u32 retrans_stamp diff --git a/Documentation/networking/netconsole.rst b/Documentation/networking/netconsole.rst index a0076b542e9c..59cb9982afe6 100644 --- a/Documentation/networking/netconsole.rst +++ b/Documentation/networking/netconsole.rst @@ -340,6 +340,38 @@ In this example, the message was sent by CPU 42. cpu=42 # kernel-populated value +Message ID auto population in userdata +-------------------------------------- + +Within the netconsole configfs hierarchy, there is a file named `msgid_enabled` +located in the `userdata` directory. This file controls the message ID +auto-population feature, which assigns a numeric id to each message sent to a +given target and appends the ID to userdata dictionary in every message sent. + +The message ID is generated using a per-target 32 bit counter that is +incremented for every message sent to the target. Note that this counter will +eventually wrap around after reaching uint32_t max value, so the message ID is +not globally unique over time. However, it can still be used by the target to +detect if messages were dropped before reaching the target by identifying gaps +in the sequence of IDs. + +It is important to distinguish message IDs from the message <sequnum> field. +Some kernel messages may never reach netconsole (for example, due to printk +rate limiting). Thus, a gap in <sequnum> cannot be solely relied upon to +indicate that a message was dropped during transmission, as it may never have +been sent via netconsole. The message ID, on the other hand, is only assigned +to messages that are actually transmitted via netconsole. + +Example:: + + echo "This is message #1" > /dev/kmsg + echo "This is message #2" > /dev/kmsg + 13,434,54928466,-;This is message #1 + msgid=1 + 13,435,54934019,-;This is message #2 + msgid=2 + + Extended console: ================= diff --git a/Documentation/networking/nf_conntrack-sysctl.rst b/Documentation/networking/nf_conntrack-sysctl.rst index 238b66d0e059..35f889259fcd 100644 --- a/Documentation/networking/nf_conntrack-sysctl.rst +++ b/Documentation/networking/nf_conntrack-sysctl.rst @@ -85,7 +85,6 @@ nf_conntrack_log_invalid - INTEGER - 1 - log ICMP packets - 6 - log TCP packets - 17 - log UDP packets - - 33 - log DCCP packets - 41 - log ICMPv6 packets - 136 - log UDPLITE packets - 255 - log packets of any protocol diff --git a/Documentation/networking/phy.rst b/Documentation/networking/phy.rst index f64641417c54..7f159043ad5a 100644 --- a/Documentation/networking/phy.rst +++ b/Documentation/networking/phy.rst @@ -333,6 +333,13 @@ Some of the interface modes are described below: SerDes lane, each port having speeds of 2.5G / 1G / 100M / 10M achieved through symbol replication. The PCS expects the standard USXGMII code word. +``PHY_INTERFACE_MODE_MIILITE`` + Non-standard, simplified MII mode, without TXER, RXER, CRS and COL signals + as defined for the MII. The absence of COL signal makes half-duplex link + modes impossible but does not interfere with BroadR-Reach link modes on + Broadcom (and other two-wire Ethernet) PHYs, because they are full-duplex + only. + Pause frames / flow control =========================== diff --git a/Documentation/networking/xdp-rx-metadata.rst b/Documentation/networking/xdp-rx-metadata.rst index a6e0ece18be5..ce96f4c99505 100644 --- a/Documentation/networking/xdp-rx-metadata.rst +++ b/Documentation/networking/xdp-rx-metadata.rst @@ -120,6 +120,39 @@ It is possible to query which kfunc the particular netdev implements via netlink. See ``xdp-rx-metadata-features`` attribute set in ``Documentation/netlink/specs/netdev.yaml``. +Driver Implementation +===================== + +Certain devices may prepend metadata to received packets. However, as of now, +``AF_XDP`` lacks the ability to communicate the size of the ``data_meta`` area +to the consumer. Therefore, it is the responsibility of the driver to copy any +device-reserved metadata out from the metadata area and ensure that +``xdp_buff->data_meta`` is pointing to ``xdp_buff->data`` before presenting the +frame to the XDP program. This is necessary so that, after the XDP program +adjusts the metadata area, the consumer can reliably retrieve the metadata +address using ``METADATA_SIZE`` offset. + +The following diagram shows how custom metadata is positioned relative to the +packet data and how pointers are adjusted for metadata access:: + + |<-- bpf_xdp_adjust_meta(xdp_buff, -METADATA_SIZE) --| + new xdp_buff->data_meta old xdp_buff->data_meta + | | + | xdp_buff->data + | | + +----------+----------------------------------------------------+------+ + | headroom | custom metadata | data | + +----------+----------------------------------------------------+------+ + | | + | xdp_desc->addr + |<------ xsk_umem__get_data() - METADATA_SIZE -------| + +``bpf_xdp_adjust_meta`` ensures that ``METADATA_SIZE`` is aligned to 4 bytes, +does not exceed 252 bytes, and leaves sufficient space for building the +xdp_frame. If these conditions are not met, it returns a negative error. In this +case, the BPF program should not proceed to populate data into the ``data_meta`` +area. + Example ======= |