summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2021-06-24gve: Update adminq commands to support DQO queuesBailey Forrest
DQO queue creation requires additional parameters: - TX completion/RX buffer queue size - TX completion/RX buffer queue address - TX/RX queue size - RX buffer size Signed-off-by: Bailey Forrest <bcf@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Catherine Sullivan <csully@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24gve: Add DQO fields for core data structuresBailey Forrest
- Add new DQO datapath structures: - `gve_rx_buf_queue_dqo` - `gve_rx_compl_queue_dqo` - `gve_rx_buf_state_dqo` - `gve_tx_desc_dqo` - `gve_tx_pending_packet_dqo` - Incorporate these into the existing ring data structures: - `gve_rx_ring` - `gve_tx_ring` Noteworthy mentions: - `gve_rx_buf_state` represents an RX buffer which was posted to HW. Each RX queue has an array of these objects and the index into the array is used as the buffer_id when posted to HW. - `gve_tx_pending_packet_dqo` is treated similarly for TX queues. The completion_tag is the index into the array. - These two structures have links for linked lists which are represented by 16b indexes into a contiguous array of these structures. This reduces memory footprint compared to 64b pointers. - We use unions for the writeable datapath structures to reduce cache footprint. GQI specific members will renamed like DQO members in a future patch. Signed-off-by: Bailey Forrest <bcf@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Catherine Sullivan <csully@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24gve: Add dqo descriptorsBailey Forrest
General description of rings and descriptors: TX ring is used for sending TX packet buffers to the NIC. It has the following descriptors: - `gve_tx_pkt_desc_dqo` - Data buffer descriptor - `gve_tx_tso_context_desc_dqo` - TSO context descriptor - `gve_tx_general_context_desc_dqo` - Generic metadata descriptor Metadata is a collection of 12 bytes. We define `gve_tx_metadata_dqo` which represents the logical interpetation of the metadata bytes. It's helpful to define this structure because the metadata bytes exist in multiple descriptor types (including `gve_tx_tso_context_desc_dqo`), and the device requires same field has the same value in all descriptors. The TX completion ring is used to receive completions from the NIC. Having a separate ring allows for completions to be out of order. The completion descriptor `gve_tx_compl_desc` has several different types, most important are packet and descriptor completions. Descriptor completions are used to notify the driver when descriptors sent on the TX ring are done being consumed. The descriptor completion is only used to signal that space is cleared in the TX ring. A packet completion will be received when a packet transmitted on the TX queue is done being transmitted. In addition there are "miss" and "reinjection" completions. The device implements a "flow-miss model". Most packets will simply receive a packet completion. The flow-miss system may choose to process a packet based on its contents. A TX packet which experiences a flow miss would receive a miss completion followed by a later reinjection completion. The miss-completion is received when the packet starts to be processed by the flow-miss system and the reinjection completion is received when the flow-miss system completes processing the packet and sends it on the wire. The RX buffer ring is used to send buffers to HW via the `gve_rx_desc_dqo` descriptor. Received packets are put into the RX queue by the device, which populates the `gve_rx_compl_desc_dqo` descriptor. The RX descriptors refer to buffers posted by the buffer queue. Received buffers may be returned out of order, such as when HW LRO is enabled. Important concepts: - "TX" and "RX buffer" queues, which send descriptors to the device, use MMIO doorbells to notify the device of new descriptors. - "RX" and "TX completion" queues, which receive descriptors from the device, use a "generation bit" to know when a descriptor was populated by the device. The driver initializes all bits with the "current generation". The device will populate received descriptors with the "next generation" which is inverted from the current generation. When the ring wraps, the current/next generation are swapped. - It's the driver's responsibility to ensure that the RX and TX completion queues are not overrun. This can be accomplished by limiting the number of descriptors posted to HW. - TX packets have a 16 bit completion_tag and RX buffers have a 16 bit buffer_id. These will be returned on the TX completion and RX queues respectively to let the driver know which packet/buffer was completed. Bitfields are used to describe descriptor fields. This notation is more concise and readable than shift-and-mask. It is possible because the driver is restricted to little endian platforms. Signed-off-by: Bailey Forrest <bcf@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Catherine Sullivan <csully@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24gve: Add support for DQO RX PTYPE mapBailey Forrest
Unlike GQI, DQO RX descriptors do not contain the L3 and L4 type of the packet. L3 and L4 types are necessary in order to set the hash and csum on RX SKBs correctly. DQO RX descriptors instead contain a 10 bit PTYPE index. The PTYPE map enables the device to tell the driver how to map from PTYPE index to L3/L4 type. The device doesn't provide any guarantees about the range of possible PTYPEs, so we just use a 1024 entry array to implement a fast mapping structure. Signed-off-by: Bailey Forrest <bcf@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Catherine Sullivan <csully@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24gve: adminq: DQO specific device descriptor logicBailey Forrest
- In addition to TX and RX queues, DQO has TX completion and RX buffer queues. - TX completions are received when the device has completed sending a packet on the wire. - RX buffers are posted on a separate queue form the RX completions. - DQO descriptor rings are allowed to be smaller than PAGE_SIZE. Signed-off-by: Bailey Forrest <bcf@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Catherine Sullivan <csully@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24gve: Introduce per netdev `enum gve_queue_format`Bailey Forrest
The currently supported queue formats are: - GQI_RDA - GQI with raw DMA addressing - GQI_QPL - GQI with queue page list - DQO_RDA - DQO with raw DMA addressing The old `gve_priv.raw_addressing` value is only used for GQI_RDA, so we remove it in favor of just checking against GQI_RDA Signed-off-by: Bailey Forrest <bcf@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Catherine Sullivan <csully@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24gve: Introduce a new model for device optionsBailey Forrest
The current model uses an integer ID and a fixed size struct for the parameters of each device option. The new model allows the device option structs to grow in size over time. A driver may assume that changes to device option structs will always be appended. New device options will also generally have a `supported_features_mask` so that the driver knows which fields within a particular device option are enabled. `gve_device_option.feat_mask` is changed to `required_features_mask`, and it is a bitmask which must match the value expected by the driver. This gives the device the ability to break backwards compatibility with old drivers for certain features by blocking the old drivers from trying to use the feature. We maintain ABI compatibility with the old model for GVE_DEV_OPT_ID_RAW_ADDRESSING in case a driver is using a device which does not support the new model. This patch introduces some new terminology: RDA - Raw DMA Addressing - Buffers associated with SKBs are directly DMA mapped and read/updated by the device. QPL - Queue Page Lists - Driver uses bounce buffers which are DMA mapped with the device for read/write and data is copied from/to SKBs. Signed-off-by: Bailey Forrest <bcf@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Catherine Sullivan <csully@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24gve: Make gve_rx_slot_page_info.page_offset an absolute offsetBailey Forrest
Using `page_offset` like a boolean means a page may only be split into two sections. With page sizes larger than 4k, this can be very wasteful. Future commits in this patchset use `struct gve_rx_slot_page_info` in a way which supports a fixed buffer size and a variable page size. Signed-off-by: Bailey Forrest <bcf@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Catherine Sullivan <csully@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24gve: gve_rx_copy: Move padding to an argumentBailey Forrest
Future use cases will have a different padding value. Signed-off-by: Bailey Forrest <bcf@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Catherine Sullivan <csully@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24gve: Move some static functions to a common fileBailey Forrest
These functions will be shared by the GQI and DQO variants of the GVNIC driver as of follow-up patches in this series. Signed-off-by: Bailey Forrest <bcf@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Catherine Sullivan <csully@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24gve: Update GVE documentation to describe DQOBailey Forrest
DQO is a new descriptor format for our next generation virtual NIC. Signed-off-by: Bailey Forrest <bcf@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Catherine Sullivan <csully@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24ipv6: fix out-of-bound access in ip6_parse_tlv()Eric Dumazet
First problem is that optlen is fetched without checking there is more than one byte to parse. Fix this by taking care of IPV6_TLV_PAD1 before fetching optlen (under appropriate sanity checks against len) Second problem is that IPV6_TLV_PADN checks of zero padding are performed before the check of remaining length. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Fixes: c1412fce7ecc ("net/ipv6/exthdrs.c: Strict PadN option checking") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24Merge branch 'macsec-key-length'David S. Miller
Antoine Tenart says: ==================== net: macsec: fix key length when offloading The key length used to copy the key to offloading drivers and to store it is wrong and was working by chance as it matched the default key length. But using a different key length fails. Fix it by using instead the max length accepted in uAPI to store the key and the actual key length when copying it. This was tested on the MSCC PHY driver but not on the Atlantic MAC (looking at the code it looks ok, but testing would be appreciated). ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24net: atlantic: fix the macsec key lengthAntoine Tenart
The key length used to store the macsec key was set to MACSEC_KEYID_LEN (16), which is an issue as: - This was never meant to be the key length. - The key length can be > 16. Fix this by using MACSEC_MAX_KEY_LEN instead (the max length accepted in uAPI). Fixes: 27736563ce32 ("net: atlantic: MACSec egress offload implementation") Fixes: 9ff40a751a6f ("net: atlantic: MACSec ingress offload implementation") Reported-by: Lior Nahmanson <liorna@nvidia.com> Signed-off-by: Antoine Tenart <atenart@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24net: phy: mscc: fix macsec key lengthAntoine Tenart
The key length used to store the macsec key was set to MACSEC_KEYID_LEN (16), which is an issue as: - This was never meant to be the key length. - The key length can be > 16. Fix this by using MACSEC_MAX_KEY_LEN instead (the max length accepted in uAPI). Fixes: 28c5107aa904 ("net: phy: mscc: macsec support") Reported-by: Lior Nahmanson <liorna@nvidia.com> Signed-off-by: Antoine Tenart <atenart@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24net: macsec: fix the length used to copy the key for offloadingAntoine Tenart
The key length used when offloading macsec to Ethernet or PHY drivers was set to MACSEC_KEYID_LEN (16), which is an issue as: - This was never meant to be the key length. - The key length can be > 16. Fix this by using MACSEC_MAX_KEY_LEN to store the key (the max length accepted in uAPI) and secy->key_len to copy it. Fixes: 3cf3227a21d1 ("net: macsec: hardware offloading infrastructure") Reported-by: Lior Nahmanson <liorna@nvidia.com> Signed-off-by: Antoine Tenart <atenart@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24usbnet: add usbnet_event_names[] for keventYajun Deng
Modify the netdev_dbg content from int to char * in usbnet_defer_kevent(), this looks more readable. Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24libceph: set global_id as soon as we get an auth ticketIlya Dryomov
Commit 61ca49a9105f ("libceph: don't set global_id until we get an auth ticket") delayed the setting of global_id too much. It is set only after all tickets are received, but in pre-nautilus clusters an auth ticket and the service tickets are obtained in separate steps (for a total of three MAuth replies). When the service tickets are requested, global_id is used to build an authorizer; if global_id is still 0 we never get them and fail to establish the session. Moving the setting of global_id into protocol implementations. This way global_id can be set exactly when an auth ticket is received, not sooner nor later. Fixes: 61ca49a9105f ("libceph: don't set global_id until we get an auth ticket") Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>
2021-06-24libceph: don't pass result into ac->ops->handle_reply()Ilya Dryomov
There is no result to pass in msgr2 case because authentication failures are reported through auth_bad_method frame and in MAuth case an error is returned immediately. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>
2021-06-24spi: Fix self assignment issue with ancillary->modeColin Ian King
There is an assignment of ancillary->mode to itself which looks dubious since the proceeding comment states that the speed and mode is taken over from the SPI main device, indicating that ancillary->mode should assigned using the value spi->mode. Fix this. Addresses-Coverity: ("Self assignment") Fixes: 0c79378c0199 ("spi: add ancillary device support") Signed-off-by: Colin Ian King <colin.king@canonical.com> Link: https://lore.kernel.org/r/20210623172300.161484-1-colin.king@canonical.com Signed-off-by: Mark Brown <broonie@kernel.org>
2021-06-24RDMA/cma: Protect RMW with qp_mutexHåkon Bugge
The struct rdma_id_private contains three bit-fields, tos_set, timeout_set, and min_rnr_timer_set. These are set by accessor functions without any synchronization. If two or all accessor functions are invoked in close proximity in time, there will be Read-Modify-Write from several contexts to the same variable, and the result will be intermittent. Fixed by protecting the bit-fields by the qp_mutex in the accessor functions. The consumer of timeout_set and min_rnr_timer_set is in rdma_init_qp_attr(), which is called with qp_mutex held for connected QPs. Explicit locking is added for the consumers of tos and tos_set. This commit depends on ("RDMA/cma: Remove unnecessary INIT->INIT transition"), since the call to rdma_init_qp_attr() from cma_init_conn_qp() does not hold the qp_mutex. Fixes: 2c1619edef61 ("IB/cma: Define option to set ack timeout and pack tos_set") Fixes: 3aeffc46afde ("IB/cma: Introduce rdma_set_min_rnr_timer()") Link: https://lore.kernel.org/r/1624369197-24578-3-git-send-email-haakon.bugge@oracle.com Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-06-24RDMA/cma: Remove unnecessary INIT->INIT transitionHåkon Bugge
In rdma_create_qp(), a connected QP will be transitioned to the INIT state. Afterwards, the QP will be transitioned to the RTR state by the cma_modify_qp_rtr() function. But this function starts by performing an ib_modify_qp() to the INIT state again, before another ib_modify_qp() is performed to transition the QP to the RTR state. Hence, there is no need to transition the QP to the INIT state in rdma_create_qp(). Link: https://lore.kernel.org/r/1624369197-24578-2-git-send-email-haakon.bugge@oracle.com Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-06-24Merge branch 'add-sparx5i-driver'David S. Miller
Steen Hegelund says: ==================== Adding the Sparx5i Switch Driver This series provides the Microchip Sparx5i Switch Driver The SparX-5 Enterprise Ethernet switch family provides a rich set of Enterprise switching features such as advanced TCAM-based VLAN and QoS processing enabling delivery of differentiated services, and security through TCAMbased frame processing using versatile content aware processor (VCAP). IPv4/IPv6 Layer 3 (L3) unicast and multicast routing is supported with up to 18K IPv4/9K IPv6 unicast LPM entries and up to 9K IPv4/3K IPv6 (S,G) multicast groups. L3 security features include source guard and reverse path forwarding (uRPF) tasks. Additional L3 features include VRF-Lite and IP tunnels (IP over GRE/IP). The SparX-5 switch family features a highly flexible set of Ethernet ports with support for 10G and 25G aggregation links, QSGMII, USGMII, and USXGMII. The device integrates a powerful 1 GHz dual-core ARM® Cortex®-A53 CPU enabling full management of the switch and advanced Enterprise applications. The SparX-5 switch family targets managed Layer 2 and Layer 3 equipment in SMB, SME, and Enterprise where high port count 1G/2.5G/5G/10G switching with 10G/25G aggregation links is required. The SparX-5 switch family consists of following SKUs: VSC7546 SparX-5-64 supports up to 64 Gbps of bandwidth with the following primary port configurations. - 6 ×10G - 16 × 2.5G + 2 × 10G - 24 × 1G + 4 × 10G VSC7549 SparX-5-90 supports up to 90 Gbps of bandwidth with the following primary port configurations. - 9 × 10G - 16 × 2.5G + 4 × 10G - 48 × 1G + 4 × 10G VSC7552 SparX-5-128 supports up to 128 Gbps of bandwidth with the following primary port configurations. - 12 × 10G - 6 x 10G + 2 x 25G - 16 × 2.5G + 8 × 10G - 48 × 1G + 8 × 10G VSC7556 SparX-5-160 supports up to 160 Gbps of bandwidth with the following primary port configurations. - 16 × 10G - 10 × 10G + 2 × 25G - 16 × 2.5G + 10 × 10G - 48 × 1G + 10 × 10G VSC7558 SparX-5-200 supports up to 200 Gbps of bandwidth with the following primary port configurations. - 20 × 10G - 8 × 25G In addition, the device supports one 10/100/1000/2500/5000 Mbps SGMII/SerDes node processor interface (NPI) Ethernet port. Time sensitive networking (TSN) is supported through a comprehensive set of features including frame preemption, cut-through, frame replication and elimination for reliability, enhanced scheduling: credit-based shaping, time-aware shaping, cyclic queuing, and forwarding, and per-stream policing and filtering. Together with IEEE 1588 and IEEE 802.1AS support, this guarantees low-latency deterministic networking for Industrial Ethernet. The Sparx5i support is developed on the PCB134 and PCB135 evaluation boards. - PCB134 main networking features: - 12x SFP+ front 10G module slots (connected to Sparx5i through SFI). - 8x SFP28 front 25G module slots (connected to Sparx5i through SFI high speed). - Optional, one additional 10/100/1000BASE-T (RJ45) Ethernet port (on-board VSC8211 PHY connected to Sparx5i through SGMII). - PCB135 main networking features: - 48x1G (10/100/1000M) RJ45 front ports using 12xVSC8514 QuadPHY’s each connected to VSC7558 through QSGMII. - 4x10G (1G/2.5G/5G/10G) RJ45 front ports using the AQR407 10G QuadPHY each port connects to VSC7558 through SFI. - 4x SFP28 25G module slots on back connected to VSC7558 through SFI high speed. - Optional, one additional 1G (10/100/1000M) RJ45 port using an on-board VSC8211 PHY, which can be connected to VSC7558 NPI port through SGMII using a loopback add-on PCB) This series provides support for: - SFPs and DAC cables via PHYLINK with a number of 5G, 10G and 25G devices and media types. - Port module configuration for 10M to 25G speeds with SGMII, QSGMII, 1000BASEX, 2500BASEX and 10GBASER as appropriate for these modes. - SerDes configuration via the Sparx5i SerDes driver (see below). - Host mode providing register based injection and extraction. - Switch mode providing MAC/VLAN table learning and Layer2 switching offloaded to the Sparx5i switch. - STP state, VLAN support, host/bridge port mode, Forwarding DB, and configuration and statistics via ethtool. More support will be added at a later stage. The Sparx5i Chip Register Model can be browsed at this location: https://github.com/microchip-ung/sparx-5_reginfo and the datasheet is available here: https://ww1.microchip.com/downloads/en/DeviceDoc/SparX-5_Family_L2L3_Enterprise_10G_Ethernet_Switches_Datasheet_00003822B.pdf The series depends on the following series currently on their way into the kernel: - 25G Base-R phy mode Link: https://lore.kernel.org/r/20210611125453.313308-1-steen.hegelund@microchip.com/ - Sparx5 Reset Driver Link: https://lore.kernel.org/r/20210416084054.2922327-1-steen.hegelund@microchip.com/ ChangeLog: v5: - cover letter - updated the description to match the latest data sheets - basic driver - added error message in case of reset controller error - port struct: replacing has_sfp with inband, adding pause_adv - host mode - port cleanup: unregisters netdevs and then removes phylink etc - checking for pause_adv when comparing port config changes - getting duplex and pause state in the link_up callback. - getting inband, autoneg and pause_adv config in the pcs_config callback. - port - use only the pause_adv bits when getting aneg status - use the inband state when updating the PCS and port config v4: - basic driver: Using devm_reset_control_get_optional_shared to get the reset control, and let the reset framework check if it is valid. - host mode (phylink): Use the PCS operations to get state and update configuration. Removed the setting of interface modes. Let phylink control this. Using the new 5gbase-r and 25gbase-r modes. Using a helper function to check if one of the 3 base-r modes has been selected. Currently it will not be possible to change the interface mode by changing the speed (e.g via ethtool). This will be added later. v3: - basic driver: - removed unneeded braces - release reference to ports node after use - use dev_err_probe to handle DEFER - update error value when bailing out (a few cases) - updated formatting of port struct and grouping of bool values - simplified the spx5_rmw and spx5_inst_rmw inline functions - host mode (netdev): - removed lockless flag - added port timer init - host mode (packet - manual injection): - updated error counters in error situations - implemented timer handling of watermark threshold: stop and restart netif queues. - fixed error message handling (rate limited) - fixed comment style error - used DIV_ROUND_UP macro - removed a debug message for open ports v2: - Updated bindings: - drop minItems for the reg property - Statistics implementation: - Reorganized statistics into ethtool groups: eth-phy, eth-mac, eth-ctrl, rmon as defined by the IEEE 802.3 categories and RFC 2819. - The remaining statistics are provided by the classic ethtool statistics command. - Hostmode support: - Removed netdev renaming - Validate ethernet address in sparx5_set_mac_address() ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24arm64: dts: sparx5: Add the Sparx5 switch nodeSteen Hegelund
This provides the configuration for the currently available evaluation boards PCB134 and PCB135. The series depends on the following series currently on its way into the kernel: - Sparx5 Reset Driver Link: https://lore.kernel.org/r/20210416084054.2922327-1-steen.hegelund@microchip.com/ Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com> Signed-off-by: Lars Povlsen <lars.povlsen@microchip.com> Signed-off-by: Bjarni Jonasson <bjarni.jonasson@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24net: sparx5: add ethtool configuration and statistics supportSteen Hegelund
This adds statistic counters for the network interfaces provided by the driver. It also adds CPU port counters (which are not exposed by ethtool). This also adds support for configuring the network interface parameters via ethtool: speed, duplex, aneg etc. Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com> Signed-off-by: Bjarni Jonasson <bjarni.jonasson@microchip.com> Signed-off-by: Lars Povlsen <lars.povlsen@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24net: sparx5: add calendar bandwidth allocation supportSteen Hegelund
This configures the Sparx5 calendars according to the bandwidth requested in the Device Tree nodes. It also checks if the total requested bandwidth is within the specs of the detected Sparx5 models limits. Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com> Signed-off-by: Bjarni Jonasson <bjarni.jonasson@microchip.com> Signed-off-by: Lars Povlsen <lars.povlsen@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24net: sparx5: add switching supportSteen Hegelund
This adds SwitchDev support by hardware offloading the software bridge. Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com> Signed-off-by: Bjarni Jonasson <bjarni.jonasson@microchip.com> Signed-off-by: Lars Povlsen <lars.povlsen@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24net: sparx5: add vlan supportSteen Hegelund
This adds Sparx5 VLAN support. Sparx5 has more VLAN features than provided here, but these will be added in later series. For now we only add the basic L2 features. Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com> Signed-off-by: Bjarni Jonasson <bjarni.jonasson@microchip.com> Signed-off-by: Lars Povlsen <lars.povlsen@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24net: sparx5: add mactable supportSteen Hegelund
This adds the Sparx5 MAC tables: listening for MAC table updates and updating on request. Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com> Signed-off-by: Bjarni Jonasson <bjarni.jonasson@microchip.com> Signed-off-by: Lars Povlsen <lars.povlsen@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24net: sparx5: add port module supportSteen Hegelund
This add configuration of the Sparx5 port module instances. Sparx5 has in total 65 logical ports (denoted D0 to D64) and 33 physical SerDes connections (S0 to S32). The 65th port (D64) is fixed allocated to SerDes0 (S0). The remaining 64 ports can in various multiplexing scenarios be connected to the remaining 32 SerDes using QSGMII, or USGMII or USXGMII extenders. 32 of the ports can have a 1:1 mapping to the 32 SerDes. Some additional ports (D65 to D69) are internal to the device and do not connect to port modules or SerDes macros. For example, internal ports are used for frame injection and extraction to the CPU queues. The 65 logical ports are split up into the following blocks. - 13 x 5G ports (D0-D11, D64) - 32 x 2G5 ports (D16-D47) - 12 x 10G ports (D12-D15, D48-D55) - 8 x 25G ports (D56-D63) Each logical port supports different line speeds, and depending on the speeds supported, different port modules (MAC+PCS) are needed. A port supporting 5 Gbps, 10 Gbps, or 25 Gbps as maximum line speed, will have a DEV5G, DEV10G, or DEV25G module to support the 5 Gbps, 10 Gbps (incl 5 Gbps), or 25 Gbps (including 10 Gbps and 5 Gbps) speeds. As well as, it will have a shadow DEV2G5 port module to support the lower speeds (10/100/1000/2500Mbps). When a port needs to operate at lower speed and the shadow DEV2G5 needs to be connected to its corresponding SerDes Not all interface modes are supported in this series, but will be added at a later stage. Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com> Signed-off-by: Bjarni Jonasson <bjarni.jonasson@microchip.com> Signed-off-by: Lars Povlsen <lars.povlsen@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24net: sparx5: add hostmode with phylink supportSteen Hegelund
This patch adds netdevs and phylink support for the ports in the switch. It also adds register based injection and extraction for these ports. Frame DMA support for injection and extraction will be added in a later series. Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com> Signed-off-by: Bjarni Jonasson <bjarni.jonasson@microchip.com> Signed-off-by: Lars Povlsen <lars.povlsen@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24net: sparx5: add the basic sparx5 driverSteen Hegelund
This adds the Sparx5 basic SwitchDev driver framework with IO range mapping, switch device detection and core clock configuration. Support for ports, phylink, netdev, mactable etc. are in the following patches. Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com> Signed-off-by: Bjarni Jonasson <bjarni.jonasson@microchip.com> Signed-off-by: Lars Povlsen <lars.povlsen@microchip.com> Reviewed-by: Philipp Zabel <p.zabel@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24dt-bindings: net: sparx5: Add sparx5-switch bindingsSteen Hegelund
Document the Sparx5 switch device driver bindings Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com> Signed-off-by: Lars Povlsen <lars.povlsen@microchip.com> Signed-off-by: Bjarni Jonasson <bjarni.jonasson@microchip.com> Reviewed-by: Rob Herring <robh@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24Merge tag 'linux-can-fixes-for-5.13-20210624' of git://git.kernel.org/David S. Miller
pub/scm/linux/kernel/git/mkl/linux-can Marc Kleine-Budde says: ==================== pull-request: can 2021-06-24 this is a pull request of 2 patches for net/master. The first patch is by Norbert Slusarek and prevent allocation of filter for optlen == 0 in the j1939 CAN protocol. The last patch is by Stephane Grosjean and fixes a potential starvation in the TX path of the peak_pciefd driver. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24Merge branch 'ibmvnic-fixes'David S. Miller
Sukadev Bhattiprolu says: ==================== ibmvnic: Assorted bug fixes Assorted bug fixes that we tested over the last several weeks. Thanks to Brian King, Cris Forno, Dany Madden and Rick Lindsley for reviews and help with testing. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24ibmvnic: parenthesize a checkSukadev Bhattiprolu
Parenthesize a check to be more explicit and to fix a sparse warning seen on some distros. Fixes: 91dc5d2553fbf ("ibmvnic: fix miscellaneous checks") Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24ibmvnic: free tx_pool if tso_pool alloc failsSukadev Bhattiprolu
Free tx_pool and clear it, if allocation of tso_pool fails. release_tx_pools() assumes we have both tx and tso_pools if ->tx_pool is non-NULL. If allocation of tso_pool fails in init_tx_pools(), the assumption will not be true and we would end up dereferencing ->tx_buff, ->free_map fields from a NULL pointer. Fixes: 3205306c6b8d ("ibmvnic: Update TX pool initialization routine") Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24ibmvnic: set ltb->buff to NULL after freeingSukadev Bhattiprolu
free_long_term_buff() checks ltb->buff to decide whether we have a long term buffer to free. So set ltb->buff to NULL afer freeing. While here, also clear ->map_id, fix up some coding style and log an error. Fixes: 9c4eaabd1bb39 ("Check CRQ command return codes") Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24ibmvnic: account for bufs already saved in indir_bufSukadev Bhattiprolu
This fixes a crash in replenish_rx_pool() when called from ibmvnic_poll() after a previous call to replenish_rx_pool() encountered an error when allocating a socket buffer. Thanks to Rick Lindsley and Dany Madden for helping debug the crash. Fixes: 4f0b6812e9b9 ("ibmvnic: Introduce batched RX buffer descriptor transmission") Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24ibmvnic: clean pending indirect buffs during resetSukadev Bhattiprolu
We batch subordinate command response queue (scrq) descriptors that we need to send to the VIOS using an "indirect" buffer. If after we queue one or more scrqs in the indirect buffer encounter an error (say fail to allocate an skb), we leave the queued scrq descriptors in the indirect buffer until the next call to ibmvnic_xmit(). On the next call to ibmvnic_xmit(), it is possible that the adapter is going through a reset and it is possible that the long term buffers have been unmapped on the VIOS side. If we proceed to flush (send) the packets that are in the indirect buffer, we will end up using the old map ids and this can cause the VIOS to trigger an unnecessary FATAL error reset. Instead of flushing packets remaining on the indirect_buff, discard (clean) them instead. Fixes: 0d973388185d4 ("ibmvnic: Introduce xmit_more support using batched subCRQ hcalls") Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24Revert "ibmvnic: remove duplicate napi_schedule call in open function"Dany Madden
This reverts commit 7c451f3ef676c805a4b77a743a01a5c21a250a73. When a vnic interface is taken down and then up, connectivity is not restored. We bisected it to this commit. Reverting this commit until we can fully investigate the issue/benefit of the change. Fixes: 7c451f3ef676 ("ibmvnic: remove duplicate napi_schedule call in open function") Reported-by: Cristobal Forno <cforno12@linux.ibm.com> Reported-by: Abdul Haleem <abdhalee@in.ibm.com> Signed-off-by: Dany Madden <drt@linux.ibm.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24Revert "ibmvnic: simplify reset_long_term_buff function"Sukadev Bhattiprolu
This reverts commit 1c7d45e7b2c29080bf6c8cd0e213cc3cbb62a054. We tried to optimize the number of hcalls we send and skipped sending the REQUEST_MAP calls for some maps. However during resets, we need to resend all the maps to the VIOS since the VIOS does not remember the old values. In fact we may have failed over to a new VIOS which will not have any of the mappings. When we send packets with map ids the VIOS does not know about, it triggers a FATAL reset. While the client does recover from the FATAL error reset, we are seeing a large number of such resets. Handling FATAL resets is lot more unnecessary work than issuing a few more hcalls so revert the commit and resend the maps to the VIOS. Fixes: 1c7d45e7b2c ("ibmvnic: simplify reset_long_term_buff function") Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24net: mdiobus: fix fwnode_mdbiobus_register() fallback caseMarcin Wojtas
The fallback case of fwnode_mdbiobus_register() (relevant for !CONFIG_FWNODE_MDIO) was defined with wrong argument name, causing a compilation error. Fix that. Signed-off-by: Marcin Wojtas <mw@semihalf.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24net: ip: avoid OOM kills with large UDP sends over loopbackJakub Kicinski
Dave observed number of machines hitting OOM on the UDP send path. The workload seems to be sending large UDP packets over loopback. Since loopback has MTU of 64k kernel will try to allocate an skb with up to 64k of head space. This has a good chance of failing under memory pressure. What's worse if the message length is <32k the allocation may trigger an OOM killer. This is entirely avoidable, we can use an skb with page frags. af_unix solves a similar problem by limiting the head length to SKB_MAX_ALLOC. This seems like a good and simple approach. It means that UDP messages > 16kB will now use fragments if underlying device supports SG, if extra allocator pressure causes regressions in real workloads we can switch to trying the large allocation first and falling back. v4: pre-calculate all the additions to alloclen so we can be sure it won't go over order-2 Reported-by: Dave Jones <dsj@fb.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24RDMA/hns: Add window selection field of congestion controlYixing Liu
The window selection field is necessary for congestion control of HIP09, it is got from firmware and then filled into QPC. Some algorithms need it to decide whether to limit the number of windows. Fixes: f91696f2f053 ("RDMA/hns: Support congestion control type selection according to the FW") Link: https://lore.kernel.org/r/1624364163-44185-1-git-send-email-liweihang@huawei.com Signed-off-by: Yixing Liu <liuyixing1@huawei.com> Signed-off-by: Weihang Li <liweihang@huawei.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-06-24tools/testing: add a selftest for SO_NETNS_COOKIELorenz Bauer
Make sure that SO_NETNS_COOKIE returns a non-zero value, and that sockets from different namespaces have a distinct cookie value. Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24net: retrieve netns cookie via getsocketoptMartynas Pumputis
It's getting more common to run nested container environments for testing cloud software. One of such examples is Kind [1] which runs a Kubernetes cluster in Docker containers on a single host. Each container acts as a Kubernetes node, and thus can run any Pod (aka container) inside the former. This approach simplifies testing a lot, as it eliminates complicated VM setups. Unfortunately, such a setup breaks some functionality when cgroupv2 BPF programs are used for load-balancing. The load-balancer BPF program needs to detect whether a request originates from the host netns or a container netns in order to allow some access, e.g. to a service via a loopback IP address. Typically, the programs detect this by comparing netns cookies with the one of the init ns via a call to bpf_get_netns_cookie(NULL). However, in nested environments the latter cannot be used given the Kubernetes node's netns is outside the init ns. To fix this, we need to pass the Kubernetes node netns cookie to the program in a different way: by extending getsockopt() with a SO_NETNS_COOKIE option, the orchestrator which runs in the Kubernetes node netns can retrieve the cookie and pass it to the program instead. Thus, this is following up on Eric's commit 3d368ab87cf6 ("net: initialize net->net_cookie at netns setup") to allow retrieval via SO_NETNS_COOKIE. This is also in line in how we retrieve socket cookie via SO_COOKIE. [1] https://kind.sigs.k8s.io/ Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Martynas Pumputis <m@lambda.lt> Cc: Eric Dumazet <edumazet@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24RDMA/hfi1: Remove use of kmap()Ira Weiny
kmap() is being deprecated and will break uses of device dax after PKS protection is introduced.[1] The kmap() used in sdma does not need to be global. Use the new kmap_local_page() which works with PKS and may provide better performance for this thread local mapping anyway. [1] https://lore.kernel.org/lkml/20201009195033.3208459-59-ira.weiny@intel.com/ Link: https://lore.kernel.org/r/20210622061422.2633501-2-ira.weiny@intel.com Signed-off-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-06-24RDMA/irdma: Remove use of kmap()Ira Weiny
kmap() is being deprecated and will break uses of device dax after PKS protection is introduced.[1] The kmap() used in the irdma CM driver is thread local. Therefore kmap_local_page() is sufficient to use and may provide performance benefits as well. kmap_local_page() will work with device dax and pgmap protected pages. Use kmap_local_page() instead of kmap(). [1] https://lore.kernel.org/lkml/20201009195033.3208459-59-ira.weiny@intel.com/ Link: https://lore.kernel.org/r/20210622165622.2638628-1-ira.weiny@intel.com Signed-off-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-06-24i40e: Fix missing rtnl locking when setting up pf switchJan Sokolowski
A recent change that made i40e use new udp_tunnel infrastructure uses a method that expects to be called under rtnl lock. However, not all codepaths do the lock prior to calling i40e_setup_pf_switch. Fix that by adding additional rtnl locking and unlocking. Fixes: 40a98cb6f01f ("i40e: convert to new udp_tunnel infrastructure") Signed-off-by: Jan Sokolowski <jan.sokolowski@intel.com> Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com> Tested-by: Tony Brelinski <tonyx.brelinski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>