summaryrefslogtreecommitdiff
path: root/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
AgeCommit message (Collapse)Author
2023-01-27net/mlx5: Move eswitch port metadata devlink param to flow eswitch codeJiri Pirko
Move the param registration and handling code into the eswitch offloads code as they are related to each other. No point in having the devlink param registration done in separate file. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-12-28net/mlx5: E-Switch, properly handle ingress tagged packets on VSTMoshe Shemesh
Fix SRIOV VST mode behavior to insert cvlan when a guest tag is already present in the frame. Previous VST mode behavior was to drop packets or override existing tag, depending on the device version. In this patch we fix this behavior by correctly building the HW steering rule with a push vlan action, or for older devices we ask the FW to stack the vlan when a vlan is already present. Fixes: 07bab9502641 ("net/mlx5: E-Switch, Refactor eswitch ingress acl codes") Fixes: dfcb1ed3c331 ("net/mlx5: E-Switch, Vport ingress/egress ACLs rules for VST mode") Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-12-07net/mlx5: E-Switch, Implement devlink port function cmds to control migratableShay Drory
Implement devlink port function commands to enable / disable migratable. This is used to control the migratable capability of the device. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Acked-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-07net/mlx5: E-Switch, Implement devlink port function cmds to control RoCEYishai Hadas
Implement devlink port function commands to enable / disable RoCE. This is used to control the RoCE device capabilities. This patch implement infrastructure which will be used by downstream patches that will add additional capabilities. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Signed-off-by: Daniel Jurgens <danielj@nvidia.com> Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Acked-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-29net/mlx5e: TC, Add offload support for trap with additional actionsMaor Dickman
TC trap action offload is currently supported only when trap is the sole action in the flow. This patch remove this limitation by changing trap action offload to not use MLX5_ATTR_FLAG_SLOW_PATH flag and instead set the flow destination table explicitly to be the slow table. This will allow offload of the additional actions. TC flow example: tc filter add dev $REP2 protocol ip prio 2 root \ flower skip_sw dst_mac $mac0 \ action mirred egress redirect dev $REP3 \ action pedit ex munge eth dst set $mac2 pipe \ action trap Signed-off-by: Maor Dickman <maord@nvidia.com> Reviewed-by: Raed Salem <raeds@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-11-24net/mlx5: E-switch, Fix duplicate lag creationChris Mi
If creating bond first and then enabling sriov in switchdev mode, will hit the following syndrome: mlx5_core 0000:08:00.0: mlx5_cmd_out_err:778:(pid 25543): CREATE_LAG(0x840) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x7d49cb), err(-22) The reason is because the offending patch removes eswitch mode none. In vf lag, the checking of eswitch mode none is replaced by checking if sriov is enabled. But when driver enables sriov, it triggers the bond workqueue task first and then setting sriov number in pci_enable_sriov(). So the check fails. Fix it by checking if sriov is enabled using eswitch internal counter that is set before triggering the bond workqueue task. Fixes: f019679ea5f2 ("net/mlx5: E-switch, Remove dependency between sriov and eswitch mode") Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-08-22net/mlx5: E-Switch, Move send to vport meta rule creationRoi Dayan
Move the creation of the rules from offloads fdb table init to per rep vport init. This way the driver will creating the send to vport meta rule on any representor, e.g. SF representors. Signed-off-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Maor Dickman <maord@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-08-22net/mlx5: E-Switch, Add default drop rule for unmatched packetsJianbo Liu
The ft_offloads table serves to steer packets, which are from the eswitch, to the representor associated with the packets' source vport. Previously, if a packet's source vport or metadata was not associated with any representor, it was forwarded to the uplink representor. The representor got packets it shouldn't have as they weren't coming from the uplink vport. One such effect of this breakage can be observed if the uplink representor is attached to a bridge, where such illegal packets will be broadcast to the remaining ports, flooding the switch with illegal packets. In the case where IB loopback (e.g, SNAP) is enabled, all transmitted packets would be looped back, and received by the uplink representor, and result in an infinite feedback loop. Therefore, block this hole by adding a default drop rule to the ft_offloads table, so that all unmatched packets with no associated representor are dropped. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Gavi Teitz <gavi@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-07-13net/mlx5: Expose vnic diagnostic counters for eswitch managed vportsMichael Guralnik
Expose on vport group managers debug counters for their managed vports. Counters are exposed through debugfs, the directory will be present only for functions that are eswitch managers and only counters that are supported on their specific HW/FW will be exposed. Example: $ ls /sys/kernel/debug/mlx5/0000:08:00.0/esw/ pf sf_8 vf_0 vf_1 $ ls -l /sys/kernel/debug/mlx5/0000:08:00.0/esw/vf_0/vnic_diag/ cq_overrun quota_exceeded_command total_q_under_processor_handle invalid_command send_queue_priority_update_flow List of all counter added: total_q_under_processor_handle - number of queues in error state due to an async error or errored command. send_queue_priority_update_flow - number of QP/SQ priority/SL update events. cq_overrun - number of times CQ entered an error state due to an overflow. async_eq_overrun -number of time an EQ mapped to async events was overrun. comp_eq_overrun - number of time an EQ mapped to completion events was overrun. quota_exceeded_command - number of commands issued and failed due to quota exceeded. invalid_command - number of commands issued and failed dues to any reason other than quota exceeded. Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-07-02net/mlx5: E-switch: Change eswitch mode only via devlink commandChris Mi
Enable or disable switchdev according to the eswitch mode set by devlink command. So it is not changed by other functions anymore. Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-07-02net/mlx5: E-switch, Remove dependency between sriov and eswitch modeChris Mi
Currently, there are three eswitch modes, none, legacy and switchdev. None is the default mode. Remove redundant none mode as eswitch mode should always be either legacy mode or switchdev mode. With this patch, there are two behavior changes: 1. Legacy becomes the default mode. When querying eswitch mode using devlink, a valid mode is always returned. 2. When disabling sriov, the eswitch mode will not change, only vfs are unloaded. Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Maor Dickman <maord@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-07-02net/mlx5: E-switch, Introduce flag to indicate if fdb table is createdChris Mi
Introduce flag to indicate if fdb table is created as a pre-step to prepare for removing dependency between sriov and eswitch mode in the downstream patches. Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-07-02net/mlx5: E-switch, Introduce flag to indicate if vport acl namespace is createdChris Mi
Eswitch vport acl namespace is needed when loading vfs. There is no need to free and reallocate it when switching eswitch mode. Introduce flag to indicate if it is created or not. When needed, create it. Only free it when the driver is unloaded or in bare metal mode. Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-05-09net/mlx5: Lag, use lag lockMark Bloch
Use a lag specific lock instead of depending on external locks to synchronise the lag creation/destruction. With this, taking E-Switch mode lock is no longer needed for syncing lag logic. Cleanup any dead code that is left over and don't export functions that aren't used outside the E-Switch core code. Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-05-09net/mlx5: Lag, move E-Switch prerequisite check into lag codeMark Bloch
There is no need to expose E-Switch function for something that can be checked with already present API inside lag code. Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-03-16net/mlx5e: MPLSoUDP decap, use vlan push_eth instead of peditMaor Dickman
Currently action pedit of source and destination MACs is used to fill the MACs in L2 push step in MPLSoUDP decap offload, this isn't aligned to tc SW which use vlan eth_push action to do this. To fix that, offload support for vlan veth_push action is added together with mpls pop action, and deprecate the use of pedit of MACs. Flow example: filter protocol mpls_uc pref 1 flower chain 0 filter protocol mpls_uc pref 1 flower chain 0 handle 0x1 eth_type 8847 mpls_label 555 enc_dst_port 6635 in_hw in_hw_count 1 action order 1: tunnel_key unset pipe index 2 ref 1 bind 1 used_hw_stats delayed action order 2: mpls pop protocol ip pipe index 2 ref 1 bind 1 used_hw_stats delayed action order 3: vlan push_eth dst_mac de:a2:ec:d6:69:c8 src_mac de:a2:ec:d6:69:c8 pipe index 2 ref 1 bind 1 used_hw_stats delayed action order 4: mirred (Egress Redirect to device enp8s0f0_0) stolen index 2 ref 1 bind 1 used_hw_stats delayed Signed-off-by: Maor Dickman <maord@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-28Merge branch 'mlx5-next' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux Saeed Mahameed says: ==================== mlx5-next 2022-22-02 The following PR includes updates to mlx5-next branch: Headlines: ========== 1) Jakub cleans up unused static inline functions 2) I did some low level firmware command interface return status changes to provide the caller with full visibility on the error/status returned by the Firmware. 3) Use the new command interface in RDMA DEVX usecases to avoid flooding dmesg with some "expected" user error prone use cases. 4) Moshe also uses the new command interface to grab the specific error code from MFRL register command to provide the exact error reason for why SW reset couldn't perform internally in FW. 5) From Mark Bloch: Lag, drop packets in hardware when possible In active-backup mode the inactive interface's packets are dropped by the bond device. In switchdev where TC rules are offloaded to the FDB this can lead to packets being hit in the FDB where without offload they would have been dropped before reaching TC rules in the kernel. Create a drop rule to make sure packets on inactive ports are dropped before reaching the FDB. Listen on NETDEV_CHANGEUPPER / NETDEV_CHANGEINFODATA events and record the inactive state and offload accordingly. * 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux: net/mlx5: Add clarification on sync reset failure net/mlx5: Add reset_state field to MFRL register RDMA/mlx5: Use new command interface API net/mlx5: cmdif, Refactor error handling and reporting of async commands net/mlx5: Use mlx5_cmd_do() in core create_{cq,dct} net/mlx5: cmdif, Add new api for command execution net/mlx5: cmdif, cmd_check refactoring net/mlx5: cmdif, Return value improvements net/mlx5: Lag, offload active-backup drops to hardware net/mlx5: Lag, record inactive state of bond device net/mlx5: Lag, don't use magic numbers for ports net/mlx5: Lag, use local variable already defined to access E-Switch net/mlx5: E-switch, add drop rule support to ingress ACL net/mlx5: E-switch, remove special uplink ingress ACL handling net/mlx5: E-Switch, reserve and use same uplink metadata across ports net/mlx5: Add ability to insert to specific flow group mlx5: remove unused static inlines ==================== Link: https://lore.kernel.org/r/20220223233930.319301-1-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-02-23net/mlx5: E-switch, add drop rule support to ingress ACLMark Bloch
Support inserting an ingress ACL drop rule on the uplink in switchdev mode. This will be used by downstream patches to offload active-backup lag mode. The drop rule (if created) is the first rule in the ACL. Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-01-27net/mlx5e: Refactor eswitch attr flags to just attr flagsRoi Dayan
The flags are flow attrs and not esw specific attr flags. Refactor to remove the esw prefix and move from eswitch.h to en_tc.h where struct mlx5_flow_attr exists. Signed-off-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Oz Shlomo <ozsh@nvidia.com> Reviewed-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-12-21net/mlx5: Remove the repeated declarationShaokun Zhang
Function 'mlx5_esw_vport_match_metadata_supported' and 'mlx5_esw_offloads_vport_metadata_set' are declared twice, so remove the repeated declaration and blank line. Cc: Saeed Mahameed <saeedm@nvidia.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-11-16net/mlx5: E-switch, Create QoS on demandDmytro Linkin
Don't create eswitch QoS (root TSAR) on switch mode change. Create it on first child TSAR object creation - vport or rate group. Keep track root TSAR references and release root TSAR with last object deletion. No need to check for QoS is enabled when installing tc matchall filter. Remove related helper function due to no users of it. Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-29net/mlx5e: Add indirect tc offload of ovs internal portAriel Levkovich
Register callbacks for tc blocks of ovs internal port devices. This allows an indirect offloading rules that apply on such devices as the filter device. In case a rule is added to a tc block of an internal port, the mlx5 driver will implicitly add a matching on the internal port's unique vport metadata value to the rule's matching list. Therefore, only packets that previously hit a rule that redirects to an internal port and got the vport metadata overwritten to the internal port's unique metadata, can match on such indirect rule. Offloading of both ingress and egress tc blocks of internal ports is supported as opposed to other devices where only ingress block offloading is supported. Signed-off-by: Ariel Levkovich <lariel@nvidia.com> Reviewed-by: Paul Blakey <paulb@nvidia.com> Reviewed-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-29net/mlx5e: Offload tc rules that redirect to ovs internal portAriel Levkovich
Allow offloading rules that redirect to ovs internal port ingress and egress. To support redirect to ingress device, offloading of REDIRECT_INGRESS action is added. When a tc rule redirects to ovs internal port, the hw rule will overwrite the input vport value in reg_c0 with a new vport metadata value that is mapped for this internal port using the internal port mapping api that is introduce in previous patches. After that the hw rule will redirect the packet to the root table to continue processing with the new vport metadata value. The new vport metadata value indicates that this packet is now arriving through an internal port and therefore should be processed using rules that apply on the same internal port as the filter device. Therefore, following rules that apply on this internal port will have to match on the same vport metadata value as part of their matching keys to make sure the packet belongs to the internal port. Signed-off-by: Ariel Levkovich <lariel@nvidia.com> Reviewed-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-29net/mlx5: E-Switch, Add ovs internal port mapping to metadata supportAriel Levkovich
Adding infrastructure to map ovs internal port device to vport match metadata to support offload of rules with internal port as the filter device or as the destination device. The infrastructure allows adding and removing internal port device to an eswitch database and getting a unique vport metadata value to be placed and match on in reg_c0 when offloading rules that are coming from or going to an internal port. The new int port metadata can be written to the source port register in HW to indicate that current source port of the packet is the internal port and not one of the actual HW vports (uplink or VF). Using this method, it is possible to offload TC rules with an OVS internal port as their destination port (overwriting the src vport register) or as the filter port (matching on the value of the src vport register and making sure it matches to the internal port's value). There is also a need to handle a miss case where the packet's src port value was changed in HW to an internal port but a following rule which matches on this new src port value wasn't found in HW. In such case, the packet will be forwarded to the driver with metadata which allows driver to restore the info of the internal port's netdevice. Once this info is restored, the uplink driver can forward the packet to the relevant netdevice in SW. Signed-off-by: Ariel Levkovich <lariel@nvidia.com> Reviewed-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-18net/mlx5: E-Switch, Increase supported number of forward destinations to 32Maor Dickman
Increase supported number of forward destinations in the same rule, local and remote, from 2 to 32. Signed-off-by: Maor Dickman <maord@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-04net/mlx5e: Support accept actionVlad Buslov
Support TC generic 'accept' action in mlx5 by introducing MLX5_ESW_ATTR_FLAG_ACCEPT attribute flag. Flag has similar semantics to existing MLX5_ESW_ATTR_FLAG_SLOW_PATH flag, however, dedicated flag is required because existing 'slow path' flag can be flipped by tunneling subsystem when neighbor changes state. Introduce new helper function mlx5_esw_attr_flags_skip() to check whether attribute flags for 'slow path' or 'accept' action are set and use it in eswitch code instead of direct bit manipulation. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Paul Blakey <paulb@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-19net/mlx5: E-switch, Allow to add vports to rate groupsDmytro Linkin
Implement eswitch API that allows updating rate groups. If group pointer is NULL, then move the vport to internal unlimited group zero. Implement devlink_ops->rate_parent_node_set() callback in the terms of the new eswitch group update API. Enable QoS for all group's elements if a group has allocated BW share. Co-developed-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Huy Nguyen <huyn@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-19net/mlx5: E-switch, Allow setting share/max tx rate limits of rate groupsDmytro Linkin
Provide eswitch API to allow controlling group rate limits. Use it to implement devlink_ops->mlx5_devlink_rate_node_tx_{share|max}_set(). The share rate will create relative bandwidth share on the groups level while within the group the user can set shared rate on the member vports of that group and this rate will be relative to the group's share rate. The group with the highest shared rate will get a BW share of 100 and the rest of the groups will get a value that reflects the ratio between their share rate and the maximum share rate. Example: Created four rate groups with tx_share limits: $ devlink port function rate add \ pci/0000:06:00.0/group_1 tx_share 30gbit $ devlink port function rate add \ pci/0000:06:00.0/group_2 tx_share 20gbit $ devlink port function rate add \ pci/0000:06:00.0/group_3 tx_share 20gbit $ devlink port function rate add \ pci/0000:06:00.0/group_4 tx_share 10gbit Assuming link speed is 50 Gbit/sec ratio divider will be 50 / (30+20+20+10) = 0.625. Normalized rate values for the groups: <group_1> 30 * 0.625 = 18.75 Gbit/sec <group_2> 20 * 0.625 = 12.5 Gbit/sec <group_3> 20 * 0.625 = 12.5 Gbit/sec <group_4> 10 * 0.625 = 6.25 Gbit/sec Rate group with unlimited tx_share rate will receive minimum BW value (1Mbit/sec) if presented any group with tx_share rate limit. This allow to not drop all packets in case of heavy traffic. Co-developed-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Huy Nguyen <huyn@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-19net/mlx5: E-switch, Introduce rate limiting groups APIDmytro Linkin
Extend eswitch API with rate limiting groups: - Define new struct mlx5_esw_rate_group that is used to hold all internal group data. - Implement functions that allow creation, destruction and cleanup of groups. - Assign all vports to internal unlimited zero group by default. This commit lays the groundwork for group rate limiting by implementing devlink_ops->rate_node_{new|del}() callbacks to support creating and deleting groups through devlink rate node objects. APIs that allows setting rates and adding/removing members are implemented in following patches. Co-developed-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Huy Nguyen <huyn@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-19net/mlx5: E-switch, Move QoS related code to dedicated fileDmytro Linkin
Move eswitch QoS related code into dedicated file. Provide eswitch API to access this code meaning it is isolated and restricted to be used only by eswitch.c. Exception is legacy NDO vf set rate, which moved to esw/legacy.c. Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Huy Nguyen <huyn@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-19net/mlx5e: TC, Restore tunnel info for sample offloadChris Mi
Currently the sample offload actions send the encapsulated packet to software. sFlow expects tunneled packets to be decapsulated while having the tunnel properties on the skb metadata fields. Reuse the functions used by connection tracking to map the outer header properties to a unique id. The next patch will use that id to restore the tunnel information of decapsulated packets onto the skb. Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Oz Shlomo <ozsh@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-19net/mlx5e: Move sample attribute to flow attributeChris Mi
Currently it is in eswitch attribute. Move it to flow attribute to reflect the change in previous patch. Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-19net/mlx5e: Move esw/sample to en/tc/sampleChris Mi
Module sample belongs to en/tc instead of esw. Move it and rename accordingly. Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-16net/mlx5: Bridge, identify port by vport_num+esw_owner_vhca_id pairVlad Buslov
Following patches in series allow traffic between vports of different eswitch instances, which requires addressing bridge port by vport_num+esw_owner_vhca_id pair since vport_num is only unique per-eswitch. As a preparation, extend struct mlx5_esw_bridge_port with 'esw_owner_vhca_id' field and use it as part of key for mlx5_esw_bridge->vports xarray. With this change we can't rely on switchdev_handle_port_obj_add() helper to get mlx5 representor from stacked device because we need specifically representor from parent eswitch that registered the callback to obtain correct esw_owner_vhca_id. The helper doesn't allow passing additional parameters to predicate function and doesn't provide access to the notifier block to obtain eswitch through br_offloads. Implement custom helpers to obtain mlx5 representor and use them in mlx5_esw_bridge_port_obj_{add|del|attr_set}() implementations. Remove direct pointer to parent bridge from struct mlx5_vport as it is no longer needed. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-10Merge branch 'mlx5-next' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux Saeed Mahameed says: ==================== pull-request: mlx5-next 2020-08-9 This pulls mlx5-next branch which includes patches already reviewed on net-next and rdma mailing lists. 1) mlx5 single E-Switch FDB for lag 2) IB/mlx5: Rename is_apu_thread_cq function to is_apu_cq 3) Add DCS caps & fields support [1] https://patchwork.kernel.org/project/netdevbpf/cover/20210803231959.26513-1-saeed@kernel.org/ [2] https://patchwork.kernel.org/project/netdevbpf/patch/0e3364dab7e0e4eea5423878b01aa42470be8d36.1626609184.git.leonro@nvidia.com/ [3] https://patchwork.kernel.org/project/netdevbpf/patch/55e1d69bef1fbfa5cf195c0bfcbe35c8019de35e.1624258894.git.leonro@nvidia.com/ * 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux: net/mlx5: Lag, Create shared FDB when in switchdev mode net/mlx5: E-Switch, add logic to enable shared FDB net/mlx5: Lag, move lag destruction to a workqueue net/mlx5: Lag, properly lock eswitch if needed net/mlx5: Add send to vport rules on paired device net/mlx5: E-Switch, Add event callback for representors net/mlx5e: Use shared mappings for restoring from metadata net/mlx5e: Add an option to create a shared mapping net/mlx5: E-Switch, set flow source for send to uplink rule RDMA/mlx5: Add shared FDB support {net, RDMA}/mlx5: Extend send to vport rules RDMA/mlx5: Fill port info based on the relevant eswitch net/mlx5: Lag, add initial logic for shared FDB net/mlx5: Return mdev from eswitch IB/mlx5: Rename is_apu_thread_cq function to is_apu_cq net/mlx5: Add DCS caps & fields support ==================== Link: https://lore.kernel.org/r/20210809202522.316930-1-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-08devlink: Simplify devlink port API callsLeon Romanovsky
Devlink port already has pointer to the devlink instance and all API calls that forward these devlink ports to the drivers perform same "devlink_port->devlink" assignment before actual call. This patch removes useless parameter and allows us in the future to create specific devlink_port_ops to manage user space access with reliable ops assignment. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-05net/mlx5: E-Switch, add logic to enable shared FDBMark Bloch
Shared FDB allows to direct traffic from all the vports in the HCA to a single eswitch. In order to do that three things are needed. 1) Point the ingress ACL of the slave uplink to that of the master. With this, wire traffic from both uplinks will reach the same eswitch with the same metadata where a single steering rule can catch traffic from both ports. 2) Set the FDB root flow table of the slave's eswitch to that of the master. As this flow table can change dynamically make sure to sync it on any set root flow table FDB command. This will make sure traffic from SFs, VFs, ECPFs and PFs reach the master eswitch. 3) Split wire traffic at the eswitch manager egress ACL so that it's directed to the native eswitch manager. We only treat wire traffic from both ports the same at the eswitch level. If such traffic wasn't handled in the eswitch it needs to reach the right representor to be processed by software. For example LACP packets should *always* reach the right uplink representor for correct operation. Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Mark Zhang <markzhang@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-05net/mlx5: Lag, properly lock eswitch if neededMark Bloch
Currently when doing hardware lag we check the eswitch mode but as this isn't done under a lock the check isn't valid. As the code needs to sync between two different devices an extra care is needed. - When going to change eswitch mode, if hardware lag is active destroy it. - While changing eswitch modes block any hardware bond creation. - Delay handling bonding events until there are no mode changes in progress. - When attaching a new mdev to lag, block until there is no mode change in progress. In order for the mode change to finish the interface lock will have to be taken. Release the lock and sleep for 100ms to allow forward progress. As this is a very rare condition (can happen if the user unbinds and binds a PCI function while also changing eswitch mode of the other PCI function) it has no real world impact. As taking multiple eswitch mode locks is now required lockdep will complain about a possible deadlock. Register a key per eswitch to make lockdep happy. Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Mark Zhang <markzhang@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-05net/mlx5e: Use shared mappings for restoring from metadataRoi Dayan
FTEs are added with mapped metadata which is saved per eswitch. When uplink reps are bonded and we are in a single FDB mode, we could fail to find metadata which was stored on one eswitch mapping but not the other or with a different id. To resolve this issue use shared mapping between eswitch ports. We do not have any conflict using a single mapping, for a type, between the ports. Signed-off-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-07-27net/mlx5: Fix mlx5_vport_tbl_attr chain from u16 to u32Chris Mi
The offending refactor commit uses u16 chain wrongly. Actually, it should be u32. Fixes: c620b772152b ("net/mlx5: Refactor tc flow attributes structure") CC: Ariel Levkovich <lariel@nvidia.com> Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-09net/mlx5: Bridge, add offload infrastructureVlad Buslov
Create new files bridge.{c|h} in en/rep directory that implement bridge interaction with representor netdevices and handle required events/notifications, bridge.{c|h} in esw directory that implement all necessary eswitch offloading infrastructure and works on vport/eswitch level. Provide new kconfig MLX5_BRIDGE which is automatically selected when both kernel bridge and mlx5 eswitch configs are enabled. Provide basic infrastructure for bridge offloads: - struct mlx5_esw_bridge_offloads - per-eswitch bridge offload structure that encapsulates generic bridge-offloads data (notifier blocks, ingress flow table/group, etc.) that is created/deleted on enable/disable eswitch offloads. - struct mlx5_esw_bridge - per-bridge structure that encapsulates per-bridge data (reference counter, FDB, egress flow table/group, etc.) that is created when first eswitch represetor is attached to new bridge and deleted when last representor is removed from the bridge as a result of NETDEV_CHANGEUPPER event. The bridge tables are created with new priority FDB_BR_OFFLOAD in FDB namespace. The new priority is between tc-miss and slow path priorities. Priority consist of two levels: the ingress table that is global per eswitch and matches incoming packets by src_mac/vid and redirects them to next level (egress table) that is chosen according to ingress port bridge membership and matches on dst_mac/vid in order to redirect packet to vport according to the following diagram: + | +---------v----------+ | | | FDB_TC_OFFLOAD | | | +---------+----------+ | | +---------v----------+ | | | FDB_FT_OFFLOAD | | | +---------+----------+ | | +---------v----------+ | | | FDB_TC_MISS | | | +---------+----------+ | +--------------------------------------+ | | | | +------+ | | | | | +------v--------+ FDB_BR_OFFLOAD | | | INGRESS_TABLE | | | +------+---+----+ | | | | match | | | +---------+ | | | | | +-------+ | | +-------v-------+ match | | | | | | EGRESS_TABLE +------------> vport | | | +-------+-------+ | | | | | | | +-------+ | | miss | | | +------+------+ | | | | +--------------------------------------+ | | +---------v----------+ | | | FDB_SLOW_PATH | | | +---------+----------+ | v Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Jianbo Liu <jianbol@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-09net/mlx5: Create TC-miss priority and tableVlad Buslov
In order to adhere to kernel software datapath model bridge offloads must come after TC and NF FDBs. Following patches in this series add new FDB priority for bridge after FDB_FT_OFFLOAD. However, since netfilter offload is implemented with unmanaged tables, its miss path is not automatically connected to next priority and requires the code to manually connect with slow table. To keep bridge offloads encapsulated and not mix it with eswitch offloads, create a new FDB_TC_MISS priority between FDB_FT_OFFLOAD and FDB_SLOW_PATH: + | +---------v----------+ | | | FDB_TC_OFFLOAD | | | +---------+----------+ | | | +---------v----------+ | | | FDB_FT_OFFLOAD | | | +---------+----------+ | | | +---------v----------+ | | | FDB_TC_MISS | | | +---------+----------+ | | | +---------v----------+ | | | FDB_SLOW_PATH | | | +---------+----------+ | v Initialize the new priority with single default empty managed table and use the table as TC/NF miss patch instead of slow table. This approach allows bridge offloads to be created as new FDB namespace priority between FDB_TC_MISS and FDB_SLOW_PATH without exposing its internal tables to any other modules since miss path of managed TC-miss table is automatically wired to next priority. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Jianbo Liu <jianbol@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-24net/mlx5: SF, Extend SF table for additional SF id rangeParav Pandit
Extended the SF table to cover additioanl SF id range of external controller. A user optionallly provides the external controller number when user wants to create SF on the external controller. An example on eswitch system: $ devlink dev eswitch set pci/0033:01:00.0 mode switchdev $ devlink port show pci/0033:01:00.0/196607: type eth netdev enP51p1s0f0np0 flavour physical port 0 splittable false pci/0033:01:00.0/131072: type eth netdev eth0 flavour pcipf controller 1 pfnum 0 external true splittable false function: hw_addr 00:00:00:00:00:00 $ devlink port add pci/0033:01:00.0 flavour pcisf pfnum 0 sfnum 77 controller 1 pci/0033:01:00.0/163840: type eth netdev eth1 flavour pcisf controller 1 pfnum 0 sfnum 77 external true splittable false function: hw_addr 00:00:00:00:00:00 state inactive opstate detached Signed-off-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-24net/mlx5: E-Switch, Consider SF ports of host PFParav Pandit
Query SF vports count and base id of host PF from the firmware. Account these ports in the total port calculation whenever it is non zero. Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Vu Pham <vuhuong@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-24net/mlx5: E-Switch, Use xarray for vport number to vport and rep mappingParav Pandit
Currently vport number to vport and its representor are mapped using an array and an index. Vport numbers of different types of functions are not contiguous. Adding new such discontiguous range using index and number mapping is increasingly complex and hard to maintain. Hence, maintain an xarray of vport and rep whose lookup is done based on the vport number. Each VF and SF entry is marked with a xarray mark to identify the function type. Additionally PF and VF needs special handling for legacy inline mode. They are additionally marked as host function using additional HOST_FN mark. Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Vu Pham <vuhuong@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-24net/mlx5: E-Switch, Prepare to return total vports from eswitch structParav Pandit
Total vports are already stored during eswitch initialization. Instead of calculating everytime, read directly from eswitch. Additionally, host PF's SF vport information is available using QUERY_HCA_CAP command. It is not available through HCA_CAP of the eswitch manager PF. Hence, this patch prepares the return total eswitch vport count from the existing eswitch struct. This further helps to keep eswitch port counting macros and logic within eswitch. Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-14net/mlx5: E-Switch, Move legacy code to a individual fileParav Pandit
Currently eswitch offers two modes. Legacy and offloads. Offloads code is already in its own file eswitch_offloads.c However eswitch.c contains the eswitch legacy code and common infrastructure code. To enable future extensions and to better manage generic common eswitch infrastructure code, move the legacy code to its own legacy.c file. Signed-off-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-14net/mlx5: E-Switch, Convert a macro to a helper routineParav Pandit
Convert ESW_ALLOWED macro to a helper routine so that it can be used in other eswitch files. Signed-off-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-14net/mlx5: E-Switch, Make vport number u16Parav Pandit
Vport number is 16-bit field in hardware. Make it u16. Move location of vport in the structure so that it reduces a hole in the structure. Signed-off-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-04-14net/mlx5: E-Switch, let user to enable disable metadataParav Pandit
Currently each packet inserted in eswitch is tagged with a internal metadata to indicate source vport. Metadata tagging is not always needed. Metadata insertion is needed for multi-port RoCE, failover between representors and stacked devices. In many other cases, metadata enablement is not needed. Metadata insertion slows down the packet processing rate of the E-switch when it is in switchdev mode. Below table show performance gain with metadata disabled for VXLAN offload rules in both SMFS and DMFS steering mode on ConnectX-5 device. ---------------------------------------------- | steering | metadata | pkt size | rx pps | | mode | | | (million) | ---------------------------------------------- | smfs | disabled | 128Bytes | 42 | ---------------------------------------------- | smfs | enabled | 128Bytes | 36 | ---------------------------------------------- | dmfs | disabled | 128Bytes | 42 | ---------------------------------------------- | dmfs | enabled | 128Bytes | 36 | ---------------------------------------------- Hence, allow user to disable metadata using driver specific devlink parameter. Metadata setting of the eswitch is applicable only for the switchdev mode. Example to show and disable metadata before changing eswitch mode: $ devlink dev param show pci/0000:06:00.0 name esw_port_metadata pci/0000:06:00.0: name esw_port_metadata type driver-specific values: cmode runtime value true $ devlink dev param set pci/0000:06:00.0 \ name esw_port_metadata value false cmode runtime $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Vu Pham <vuhuong@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> --- changelog: v1->v2: - added performance numbers in commit log - updated commit log and documentation for switchdev mode - added explicit note on when user can disable metadata in documentation