diff options
Diffstat (limited to 'Documentation/networking/netdevices.rst')
| -rw-r--r-- | Documentation/networking/netdevices.rst | 418 |
1 files changed, 418 insertions, 0 deletions
diff --git a/Documentation/networking/netdevices.rst b/Documentation/networking/netdevices.rst new file mode 100644 index 000000000000..7ebb6c36482d --- /dev/null +++ b/Documentation/networking/netdevices.rst @@ -0,0 +1,418 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================================== +Network Devices, the Kernel, and You! +===================================== + + +Introduction +============ +The following is a random collection of documentation regarding +network devices. It is intended for driver developers. + +struct net_device lifetime rules +================================ +Network device structures need to persist even after module is unloaded and +must be allocated with alloc_netdev_mqs() and friends. +If device has registered successfully, it will be freed on last use +by free_netdev(). This is required to handle the pathological case cleanly +(example: ``rmmod mydriver </sys/class/net/myeth/mtu``) + +alloc_netdev_mqs() / alloc_netdev() reserve extra space for driver +private data which gets freed when the network device is freed. If +separately allocated data is attached to the network device +(netdev_priv()) then it is up to the module exit handler to free that. + +There are two groups of APIs for registering struct net_device. +First group can be used in normal contexts where ``rtnl_lock`` is not already +held: register_netdev(), unregister_netdev(). +Second group can be used when ``rtnl_lock`` is already held: +register_netdevice(), unregister_netdevice(), free_netdevice(). + +Simple drivers +-------------- + +Most drivers (especially device drivers) handle lifetime of struct net_device +in context where ``rtnl_lock`` is not held (e.g. driver probe and remove paths). + +In that case the struct net_device registration is done using +the register_netdev(), and unregister_netdev() functions: + +.. code-block:: c + + int probe() + { + struct my_device_priv *priv; + int err; + + dev = alloc_netdev_mqs(...); + if (!dev) + return -ENOMEM; + priv = netdev_priv(dev); + + /* ... do all device setup before calling register_netdev() ... + */ + + err = register_netdev(dev); + if (err) + goto err_undo; + + /* net_device is visible to the user! */ + + err_undo: + /* ... undo the device setup ... */ + free_netdev(dev); + return err; + } + + void remove() + { + unregister_netdev(dev); + free_netdev(dev); + } + +Note that after calling register_netdev() the device is visible in the system. +Users can open it and start sending / receiving traffic immediately, +or run any other callback, so all initialization must be done prior to +registration. + +unregister_netdev() closes the device and waits for all users to be done +with it. The memory of struct net_device itself may still be referenced +by sysfs but all operations on that device will fail. + +free_netdev() can be called after unregister_netdev() returns on when +register_netdev() failed. + +Device management under RTNL +---------------------------- + +Registering struct net_device while in context which already holds +the ``rtnl_lock`` requires extra care. In those scenarios most drivers +will want to make use of struct net_device's ``needs_free_netdev`` +and ``priv_destructor`` members for freeing of state. + +Example flow of netdev handling under ``rtnl_lock``: + +.. code-block:: c + + static void my_setup(struct net_device *dev) + { + dev->needs_free_netdev = true; + } + + static void my_destructor(struct net_device *dev) + { + some_obj_destroy(priv->obj); + some_uninit(priv); + } + + int create_link() + { + struct my_device_priv *priv; + int err; + + ASSERT_RTNL(); + + dev = alloc_netdev(sizeof(*priv), "net%d", NET_NAME_UNKNOWN, my_setup); + if (!dev) + return -ENOMEM; + priv = netdev_priv(dev); + + /* Implicit constructor */ + err = some_init(priv); + if (err) + goto err_free_dev; + + priv->obj = some_obj_create(); + if (!priv->obj) { + err = -ENOMEM; + goto err_some_uninit; + } + /* End of constructor, set the destructor: */ + dev->priv_destructor = my_destructor; + + err = register_netdevice(dev); + if (err) + /* register_netdevice() calls destructor on failure */ + goto err_free_dev; + + /* If anything fails now unregister_netdevice() (or unregister_netdev()) + * will take care of calling my_destructor and free_netdev(). + */ + + return 0; + + err_some_uninit: + some_uninit(priv); + err_free_dev: + free_netdev(dev); + return err; + } + +If struct net_device.priv_destructor is set it will be called by the core +some time after unregister_netdevice(), it will also be called if +register_netdevice() fails. The callback may be invoked with or without +``rtnl_lock`` held. + +There is no explicit constructor callback, driver "constructs" the private +netdev state after allocating it and before registration. + +Setting struct net_device.needs_free_netdev makes core call free_netdevice() +automatically after unregister_netdevice() when all references to the device +are gone. It only takes effect after a successful call to register_netdevice() +so if register_netdevice() fails driver is responsible for calling +free_netdev(). + +free_netdev() is safe to call on error paths right after unregister_netdevice() +or when register_netdevice() fails. Parts of netdev (de)registration process +happen after ``rtnl_lock`` is released, therefore in those cases free_netdev() +will defer some of the processing until ``rtnl_lock`` is released. + +Devices spawned from struct rtnl_link_ops should never free the +struct net_device directly. + +.ndo_init and .ndo_uninit +~~~~~~~~~~~~~~~~~~~~~~~~~ + +``.ndo_init`` and ``.ndo_uninit`` callbacks are called during net_device +registration and de-registration, under ``rtnl_lock``. Drivers can use +those e.g. when parts of their init process need to run under ``rtnl_lock``. + +``.ndo_init`` runs before device is visible in the system, ``.ndo_uninit`` +runs during de-registering after device is closed but other subsystems +may still have outstanding references to the netdevice. + +MTU +=== +Each network device has a Maximum Transfer Unit. The MTU does not +include any link layer protocol overhead. Upper layer protocols must +not pass a socket buffer (skb) to a device to transmit with more data +than the mtu. The MTU does not include link layer header overhead, so +for example on Ethernet if the standard MTU is 1500 bytes used, the +actual skb will contain up to 1514 bytes because of the Ethernet +header. Devices should allow for the 4 byte VLAN header as well. + +Segmentation Offload (GSO, TSO) is an exception to this rule. The +upper layer protocol may pass a large socket buffer to the device +transmit routine, and the device will break that up into separate +packets based on the current MTU. + +MTU is symmetrical and applies both to receive and transmit. A device +must be able to receive at least the maximum size packet allowed by +the MTU. A network device may use the MTU as mechanism to size receive +buffers, but the device should allow packets with VLAN header. With +standard Ethernet mtu of 1500 bytes, the device should allow up to +1518 byte packets (1500 + 14 header + 4 tag). The device may either: +drop, truncate, or pass up oversize packets, but dropping oversize +packets is preferred. + + +struct net_device synchronization rules +======================================= +ndo_open: + Synchronization: rtnl_lock() semaphore. In addition, netdev instance + lock if the driver implements queue management or shaper API. + Context: process + +ndo_stop: + Synchronization: rtnl_lock() semaphore. In addition, netdev instance + lock if the driver implements queue management or shaper API. + Context: process + Note: netif_running() is guaranteed false + +ndo_do_ioctl: + Synchronization: rtnl_lock() semaphore. + + This is only called by network subsystems internally, + not by user space calling ioctl as it was in before + linux-5.14. + +ndo_siocbond: + Synchronization: rtnl_lock() semaphore. In addition, netdev instance + lock if the driver implements queue management or shaper API. + Context: process + + Used by the bonding driver for the SIOCBOND family of + ioctl commands. + +ndo_siocwandev: + Synchronization: rtnl_lock() semaphore. In addition, netdev instance + lock if the driver implements queue management or shaper API. + Context: process + + Used by the drivers/net/wan framework to handle + the SIOCWANDEV ioctl with the if_settings structure. + +ndo_siocdevprivate: + Synchronization: rtnl_lock() semaphore. In addition, netdev instance + lock if the driver implements queue management or shaper API. + Context: process + + This is used to implement SIOCDEVPRIVATE ioctl helpers. + These should not be added to new drivers, so don't use. + +ndo_eth_ioctl: + Synchronization: rtnl_lock() semaphore. In addition, netdev instance + lock if the driver implements queue management or shaper API. + Context: process + +ndo_get_stats: + Synchronization: RCU (can be called concurrently with the stats + update path). + Context: atomic (can't sleep under RCU) + +ndo_start_xmit: + Synchronization: __netif_tx_lock spinlock. + + When the driver sets dev->lltx this will be + called without holding netif_tx_lock. In this case the driver + has to lock by itself when needed. + The locking there should also properly protect against + set_rx_mode. WARNING: use of dev->lltx is deprecated. + Don't use it for new drivers. + + Context: Process with BHs disabled or BH (timer), + will be called with interrupts disabled by netconsole. + + Return codes: + + * NETDEV_TX_OK everything ok. + * NETDEV_TX_BUSY Cannot transmit packet, try later + Usually a bug, means queue start/stop flow control is broken in + the driver. Note: the driver must NOT put the skb in its DMA ring. + +ndo_tx_timeout: + Synchronization: netif_tx_lock spinlock; all TX queues frozen. + Context: BHs disabled + Notes: netif_queue_stopped() is guaranteed true + +ndo_set_rx_mode: + Synchronization: netif_addr_lock spinlock. + Context: BHs disabled + +ndo_setup_tc: + ``TC_SETUP_BLOCK`` and ``TC_SETUP_FT`` are running under NFT locks + (i.e. no ``rtnl_lock`` and no device instance lock). The rest of + ``tc_setup_type`` types run under netdev instance lock if the driver + implements queue management or shaper API. + +Most ndo callbacks not specified in the list above are running +under ``rtnl_lock``. In addition, netdev instance lock is taken as well if +the driver implements queue management or shaper API. + +struct napi_struct synchronization rules +======================================== +napi->poll: + Synchronization: + NAPI_STATE_SCHED bit in napi->state. Device + driver's ndo_stop method will invoke napi_disable() on + all NAPI instances which will do a sleeping poll on the + NAPI_STATE_SCHED napi->state bit, waiting for all pending + NAPI activity to cease. + + Context: + softirq + will be called with interrupts disabled by netconsole. + +netdev instance lock +==================== + +Historically, all networking control operations were protected by a single +global lock known as ``rtnl_lock``. There is an ongoing effort to replace this +global lock with separate locks for each network namespace. Additionally, +properties of individual netdev are increasingly protected by per-netdev locks. + +For device drivers that implement shaping or queue management APIs, all control +operations will be performed under the netdev instance lock. +Drivers can also explicitly request instance lock to be held during ops +by setting ``request_ops_lock`` to true. Code comments and docs refer +to drivers which have ops called under the instance lock as "ops locked". +See also the documentation of the ``lock`` member of struct net_device. + +In the future, there will be an option for individual +drivers to opt out of using ``rtnl_lock`` and instead perform their control +operations directly under the netdev instance lock. + +Devices drivers are encouraged to rely on the instance lock where possible. + +For the (mostly software) drivers that need to interact with the core stack, +there are two sets of interfaces: ``dev_xxx``/``netdev_xxx`` and ``netif_xxx`` +(e.g., ``dev_set_mtu`` and ``netif_set_mtu``). The ``dev_xxx``/``netdev_xxx`` +functions handle acquiring the instance lock themselves, while the +``netif_xxx`` functions assume that the driver has already acquired +the instance lock. + +struct net_device_ops +--------------------- + +``ndos`` are called without holding the instance lock for most drivers. + +"Ops locked" drivers will have most of the ``ndos`` invoked under +the instance lock. + +struct ethtool_ops +------------------ + +Similarly to ``ndos`` the instance lock is only held for select drivers. +For "ops locked" drivers all ethtool ops without exceptions should +be called under the instance lock. + +struct netdev_stat_ops +---------------------- + +"qstat" ops are invoked under the instance lock for "ops locked" drivers, +and under rtnl_lock for all other drivers. + +struct net_shaper_ops +--------------------- + +All net shaper callbacks are invoked while holding the netdev instance +lock. ``rtnl_lock`` may or may not be held. + +Note that supporting net shapers automatically enables "ops locking". + +struct netdev_queue_mgmt_ops +---------------------------- + +All queue management callbacks are invoked while holding the netdev instance +lock. ``rtnl_lock`` may or may not be held. + +Note that supporting struct netdev_queue_mgmt_ops automatically enables +"ops locking". + +Notifiers and netdev instance lock +---------------------------------- + +For device drivers that implement shaping or queue management APIs, +some of the notifiers (``enum netdev_cmd``) are running under the netdev +instance lock. + +The following netdev notifiers are always run under the instance lock: +* ``NETDEV_XDP_FEAT_CHANGE`` + +For devices with locked ops, currently only the following notifiers are +running under the lock: +* ``NETDEV_CHANGE`` +* ``NETDEV_REGISTER`` +* ``NETDEV_UP`` + +The following notifiers are running without the lock: +* ``NETDEV_UNREGISTER`` + +There are no clear expectations for the remaining notifiers. Notifiers not on +the list may run with or without the instance lock, potentially even invoking +the same notifier type with and without the lock from different code paths. +The goal is to eventually ensure that all (or most, with a few documented +exceptions) notifiers run under the instance lock. Please extend this +documentation whenever you make explicit assumption about lock being held +from a notifier. + +NETDEV_INTERNAL symbol namespace +================================ + +Symbols exported as NETDEV_INTERNAL can only be used in networking +core and drivers which exclusively flow via the main networking list and trees. +Note that the inverse is not true, most symbols outside of NETDEV_INTERNAL +are not expected to be used by random code outside netdev either. +Symbols may lack the designation because they predate the namespaces, +or simply due to an oversight. |
