summaryrefslogtreecommitdiff
path: root/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
AgeCommit message (Collapse)Author
2023-05-22net/mlx5e: Fix SQ wake logic in ptp napi_poll contextRahul Rameshbabu
Check in the mlx5e_ptp_poll_ts_cq context if the ptp tx sq should be woken up. Before change, the ptp tx sq may never wake up if the ptp tx ts skb fifo is full when mlx5e_poll_tx_cq checks if the queue should be woken up. Fixes: 1880bc4e4a96 ("net/mlx5e: Add TX port timestamp support") Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-04-19net/mlx5e: XDP, Add support for multi-buffer XDP redirect-inTariq Toukan
Handle multi-buffer XDP redirect-in requests coming through mlx5e_xdp_xmit. Extend struct mlx5e_xmit_data_frags with an additional dma_arr field, to point to the fragments dma mapping, as they cannot be retrieved via the page_pool_get_dma_addr() function. Push a dma_addr xdpi instance per each fragment, and use them in the completion flow to dma_unmap the frags. Finally, remove the restriction in mlx5e_open_xdpsq, and set the flag in xdp_features. Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-19net/mlx5e: Introduce extended version for mlx5e_xmit_dataTariq Toukan
Introduce struct mlx5e_xmit_data_frags to be used for non-linear xmit buffers. Let it include sinfo pointer. Take one bit from the len field to indicate if the descriptor has fragments and can be casted-up into the extended version. Zero-init to make sure has_frags, and potentially future fields, are zero when not explicitly assigned. Another field will be added in a downstream patch to indicate and point to dma addresses of the different frags, for redirect-in requests. This simplifies the mlx5e_xmit_xdp_frame/mlx5e_xmit_xdp_frame_mpwqe functions params. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-04-19net/mlx5e: Move struct mlx5e_xmit_data to datapath headerTariq Toukan
Move TX datapath struct from the generic en.h to the datapath txrx.h header, where it belongs. Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-03-28net/mlx5e: RX, Defer page release in striding rq for better recyclingDragos Tatulea
Currently, for striding RQ, fragmented pages from the page pool can get released in two ways: 1) In the mlx5e driver when trimming off the unused fragments AND the associated skb fragments have been released. This path allows recycling of pages to the page pool cache (allow_direct == true). 2) On the skb release path (last fragment release), which will always release pages to the page pool ring (allow_direct == false). Whichever is releasing the last fragment will be decisive on where the page gets released: the cache or the ring. So we obviously want to maximize for doing the release from 1. This patch does that by deferring the release of page fragments right before requesting new ones from the page pool. Extra care needs to be taken for the corner cases: * On first call, make sure that release is not called. The skip_release_bitmap is used for this purpose. * On rq shutdown, make sure that all wqes that were not in the linked list are released. For a single ring, single core, default MTU (1500) TCP stream test the number of pages allocated from the cache directly (rx_pp_recycle_cached) increases from 31 % to 98 %: +----------------------------------------------+ | Page Pool stats (/sec) | Before | After | +-------------------------+---------+----------+ |rx_pp_alloc_fast | 2137754 | 2261033 | |rx_pp_alloc_slow | 47 | 9 | |rx_pp_alloc_empty | 47 | 9 | |rx_pp_alloc_refill | 23230 | 819 | |rx_pp_alloc_waive | 0 | 0 | |rx_pp_recycle_cached | 672182 | 2209015 | |rx_pp_recycle_cache_full | 1789 | 0 | |rx_pp_recycle_ring | 1485848 | 52259 | |rx_pp_recycle_ring_full | 3003 | 584 | +----------------------------------------------+ With this patch, the performance in striding rq for the above test is back to baseline. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-03-28net/mlx5e: RX, Enable skb page recycling through the page_poolDragos Tatulea
Start using the page_pool skb recycling api to recycle all pages back to the page pool and stop using atomic page reference counting. The mlx5e driver used to manage in-flight pages using page refcounting: for each fragment there were 2 atomic write operations happening (one for building the skb and one on skb release). The page_pool api introduced a method to track page fragments more optimally: * The page's pp_fragment_count is set to a large bias on page alloc (1 x atomic write operation). * The driver tracks the actual page fragments in a non atomic variable. * When the skb is recycled, pp_fragment_count is decremented (atomic write operation). * When page is released in the driver, the unused number of fragments (relative to the bias) is deducted from pp_fragment_count (atomic write operation). * Last page defragmentation will only be an atomic read. So in total there are `number of fragments + 1` atomic write ops. As opposed to previously: `2 * frags` atomic writes ops. Pages are wrapped in a mlx5e_frag_page structure which also contains the number of fragments. This makes it easy to count the fragments in the driver. This change brings performance improvements for the case when the old rx page_cache had low recycling rates due to head of queue blocking. For a iperf3 TCP test with a single stream, on a single core (iperf and receive queue running on same core), the following improvements can be noticed: * Striding rq: - before (net-next baseline): bitrate = 30.1 Gbits/sec - after : bitrate = 31.4 Gbits/sec (diff: 4.14 %) * Legacy rq: - before (net-next baseline): bitrate = 30.2 Gbits/sec - after : bitrate = 33.0 Gbits/sec (diff: 8.48 %) There are 2 temporary performance degradations introduced: 1) TCP streams that had a good recycling rate with the old page_cache have a degradation for both striding and linear rq. This is due to very low page pool cache recycling: the pages are released during skb recycle which will release pages to the page pool ring for safety. The following patches in this series will tackle this problem by deferring the page release in the driver to increase the chance of having pages recycled to the cache. 2) XDP performance is now lower (4-5 %) due to the higher number of atomic operations used for fragment management. But this opens the door for supporting multiple packets per page in XDP, which will bring a big gain. Otherwise, performance is similar to baseline. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-03-28net/mlx5e: RX, Enable dma map and sync from page_pool allocatorDragos Tatulea
Remove driver dma mapping and unmapping of pages. Let the page_pool api do it. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-03-28net/mlx5e: RX, Remove alloc unit layout constraint for striding rqDragos Tatulea
This change removes the usage of mlx5e_alloc_unit union for striding rq. The change is more straightforward than legacy rq as the alloc units union is already in place. This patch only moves things around: instead of an array of unions make it a union of arrays. This has the effect that each mlx5e_mpw_info will allocate the largest possible size of the array member. For xsk this means that the array of xdp_buff pointers for the wqe will still be contiguous, but there will be some extra unused bytes at the end of the array. As further patch in the series will add the mlx5e_frag_page struct for which the described size constraint will no longer hold. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-03-15net/mlx5e: Utilize the entire fifoRahul Rameshbabu
Previous check was comparing against the fifo mask. The mask is size of the fifo (power of two) minus one, so a less than or equal comparator should be used for checking if the fifo has room for the SKB. Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Link: https://lore.kernel.org/r/20230314054234.267365-6-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-24mlx5: fix possible ptp queue fifo use-after-freeVadim Fedorenko
Fifo indexes are not checked during pop operations and it leads to potential use-after-free when poping from empty queue. Such case was possible during re-sync action. WARN_ON_ONCE covers future cases. There were out-of-order cqe spotted which lead to drain of the queue and use-after-free because of lack of fifo pointers check. Special check and counter are added to avoid resync operation if SKB could not exist in the fifo because of OOO cqe (skb_id must be between consumer and producer index). Fixes: 58a518948f60 ("net/mlx5e: Add resiliency for PTP TX port timestamp") Signed-off-by: Vadim Fedorenko <vadfed@meta.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-02-24mlx5: fix skb leak while fifo resync and pushVadim Fedorenko
During ptp resync operation SKBs were poped from the fifo but were never freed neither by napi_consume nor by dev_kfree_skb_any. Add call to napi_consume_skb to properly free SKBs. Another leak was happening because mlx5e_skb_fifo_has_room() had an error in the check. Comparing free running counters works well unless C promotes the types to something wider than the counter. In this case counters are u16 but the result of the substraction is promouted to int and it causes wrong result (negative value) of the check when producer have already overlapped but consumer haven't yet. Explicit cast to u16 fixes the issue. Fixes: 58a518948f60 ("net/mlx5e: Add resiliency for PTP TX port timestamp") Reviewed-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Vadim Fedorenko <vadfed@meta.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-02-18net/mlx5e: Remove unused function mlx5e_sq_xmit_simpleTariq Toukan
The last usage was removed as part of commit 40379a0084c2 ("net/mlx5_fpga: Drop INNOVA TLS support"). Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-01-28Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== bpf-next 2023-01-28 We've added 124 non-merge commits during the last 22 day(s) which contain a total of 124 files changed, 6386 insertions(+), 1827 deletions(-). The main changes are: 1) Implement XDP hints via kfuncs with initial support for RX hash and timestamp metadata kfuncs, from Stanislav Fomichev and Toke Høiland-Jørgensen. Measurements on overhead: https://lore.kernel.org/bpf/875yellcx6.fsf@toke.dk 2) Extend libbpf's bpf_tracing.h support for tracing arguments of kprobes/uprobes and syscall as a special case, from Andrii Nakryiko. 3) Significantly reduce the search time for module symbols by livepatch and BPF, from Jiri Olsa and Zhen Lei. 4) Enable cpumasks to be used as kptrs, which is useful for tracing programs tracking which tasks end up running on which CPUs in different time intervals, from David Vernet. 5) Fix several issues in the dynptr processing such as stack slot liveness propagation, missing checks for PTR_TO_STACK variable offset, etc, from Kumar Kartikeya Dwivedi. 6) Various performance improvements, fixes, and introduction of more than just one XDP program to XSK selftests, from Magnus Karlsson. 7) Big batch to BPF samples to reduce deprecated functionality, from Daniel T. Lee. 8) Enable struct_ops programs to be sleepable in verifier, from David Vernet. 9) Reduce pr_warn() noise on BTF mismatches when they are expected under the CONFIG_MODULE_ALLOW_BTF_MISMATCH config anyway, from Connor O'Brien. 10) Describe modulo and division by zero behavior of the BPF runtime in BPF's instruction specification document, from Dave Thaler. 11) Several improvements to libbpf API documentation in libbpf.h, from Grant Seltzer. 12) Improve resolve_btfids header dependencies related to subcmd and add proper support for HOSTCC, from Ian Rogers. 13) Add ipip6 and ip6ip decapsulation support for bpf_skb_adjust_room() helper along with BPF selftests, from Ziyang Xuan. 14) Simplify the parsing logic of structure parameters for BPF trampoline in the x86-64 JIT compiler, from Pu Lehui. 15) Get BTF working for kernels with CONFIG_RUST enabled by excluding Rust compilation units with pahole, from Martin Rodriguez Reboredo. 16) Get bpf_setsockopt() working for kTLS on top of TCP sockets, from Kui-Feng Lee. 17) Disable stack protection for BPF objects in bpftool given BPF backends don't support it, from Holger Hoffstätte. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (124 commits) selftest/bpf: Make crashes more debuggable in test_progs libbpf: Add documentation to map pinning API functions libbpf: Fix malformed documentation formatting selftests/bpf: Properly enable hwtstamp in xdp_hw_metadata selftests/bpf: Calls bpf_setsockopt() on a ktls enabled socket. bpf: Check the protocol of a sock to agree the calls to bpf_setsockopt(). bpf/selftests: Verify struct_ops prog sleepable behavior bpf: Pass const struct bpf_prog * to .check_member libbpf: Support sleepable struct_ops.s section bpf: Allow BPF_PROG_TYPE_STRUCT_OPS programs to be sleepable selftests/bpf: Fix vmtest static compilation error tools/resolve_btfids: Alter how HOSTCC is forced tools/resolve_btfids: Install subcmd headers bpf/docs: Document the nocast aliasing behavior of ___init bpf/docs: Document how nested trusted fields may be defined bpf/docs: Document cpumask kfuncs in a new file selftests/bpf: Add selftest suite for cpumask kfuncs selftests/bpf: Add nested trust selftests suite bpf: Enable cpumasks to be queried and used as kptrs bpf: Disallow NULLable pointers for trusted kfuncs ... ==================== Link: https://lore.kernel.org/r/20230128004827.21371-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-23net/mlx5e: Support RX XDP metadataToke Høiland-Jørgensen
Support RX hash and timestamp metadata kfuncs. We need to pass in the cqe pointer to the mlx5e_skb_from* functions so it can be retrieved from the XDP ctx to do this. Cc: Tariq Toukan <tariqt@nvidia.com> Cc: Saeed Mahameed <saeedm@nvidia.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: David Ahern <dsahern@gmail.com> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Willem de Bruijn <willemb@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Anatoly Burakov <anatoly.burakov@intel.com> Cc: Alexander Lobakin <alexandr.lobakin@intel.com> Cc: Magnus Karlsson <magnus.karlsson@gmail.com> Cc: Maryam Tahhan <mtahhan@redhat.com> Cc: xdp-hints@xdp-project.net Cc: netdev@vger.kernel.org Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20230119221536.3349901-17-sdf@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-01-18net/mlx5e: Suppress Send WQEBB room warning for PAGE_SIZE >= 16KBRahul Rameshbabu
Send WQEBB size is 64 bytes and the max number of WQEBBs for an SQ is 255. For 16KB pages and greater, there is always sufficient spaces for all WQEBBs of an SQ. Cast mlx5e_get_max_sq_wqebbs(mdev) to u16. Prevents -Wtautological-constant-out-of-range-compare warnings from occurring when PAGE_SIZE >= 16KB. Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-11-09net/mlx5e: Add missing sanity checks for max TX WQE sizeMaxim Mikityanskiy
The commit cited below started using the firmware capability for the maximum TX WQE size. This commit adds an important check to verify that the driver doesn't attempt to exceed this capability, and also restores another check mistakenly removed in the cited commit (a WQE must not exceed the page size). Fixes: c27bd1718c06 ("net/mlx5e: Read max WQEBBs on the SQ from firmware") Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-10-27net/mlx5e: Extend SKB room check to include PTP-SQAya Levin
When tx_port_ts is set, the driver diverts all UPD traffic over PTP port to a dedicated PTP-SQ. The SKBs are cached until the wire-CQE arrives. When the packet size is greater then MTU, the firmware might drop it and the packet won't be transmitted to the wire, hence the wire-CQE won't reach the driver. In this case the SKBs are accumulated in the SKB fifo. Add room check to consider the PTP-SQ SKB fifo, when the SKB fifo is full, driver stops the queue resulting in a TX timeout. Devlink TX-reporter can recover from it. Fixes: 1880bc4e4a96 ("net/mlx5e: Add TX port timestamp support") Signed-off-by: Aya Levin <ayal@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Link: https://lore.kernel.org/r/20221026135153.154807-5-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-01net/mlx5e: xsk: Use xsk_buff_alloc_batch on striding RQMaxim Mikityanskiy
XSK provides a function to allocate frames in batches for more efficient processing. This commit starts using this function on striding RQ and creates an optimized flow for XSK. A side effect is an opportunity to optimize the regular RX flow by dropping branching for XSK cases. Performance improvement is up to 6.4% in the aligned mode and up to 7.5% in the unaligned mode. Aligned mode, 2048-byte frames: 12.9 Mpps -> 13.8 Mpps Aligned mode, 4096-byte frames: 11.8 Mpps -> 12.5 Mpps Unaligned mode, 2048-byte frames: 11.9 Mpps -> 12.8 Mpps Unaligned mode, 3072-byte frames: 11.4 Mpps -> 12.1 Mpps Unaligned mode, 4096-byte frames: 11.0 Mpps -> 11.2 Mpps CPU: Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-28net/mlx5e: kTLS, Check ICOSQ WQE size in advanceMaxim Mikityanskiy
Instead of WARNing in runtime when TLS offload WQEs posted to ICOSQ are over the hardware limit, check their size before enabling TLS RX offload, and block the offload if the condition fails. It also allows to drop a u16 field from struct mlx5e_icosq. Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-28net/mlx5e: Use the aligned max TX MPWQE sizeMaxim Mikityanskiy
TX MPWQE size is limited to the cacheline-aligned maximum. Use the same value for the stop room and the capability check. Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-18net/mlx5e: Store DMA address inside struct pageMaxim Mikityanskiy
Use page_pool_set_dma_addr() to store the DMA address of a page inside struct page, in order to avoid passing struct mlx5e_dma_info to XDP handlers. Previously, struct mlx5e_dma_info was used to pass both the DMA address and the page, and it worked well for the single-fragment case. When XDP multi buffer is in use, and a fragmented xdp_frame has to be transmitted, the driver needs to know the DMA addresses of fragments, however, the array of fragments in struct skb_shared_info doesn't contain them. In order to pass the DMA addresses, the driver puts them into struct page itself, which is accessible from the array of fragments in struct skb_shared_info. The existing XDP handlers are modified to remove the dependency on struct mlx5e_dma_info. Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-02-14net/mlx5e: Move mlx5e_select_queue to en/selq.cMaxim Mikityanskiy
This commit moves mlx5e_select_queue and all stuff related to ndo_select_queue to en/selq.c to put all stuff working with selq into a separate file. Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-02-14net/mlx5e: Use FW limitation for max MPW WQEBBsAya Levin
Calculate maximal count of MPW WQEBBs on SQ's creation and store it there. Remove MLX5E_TX_MPW_MAX_NUM_DS and MLX5E_TX_MPW_MAX_WQEBBS. Update mlx5e_tx_mpwqe_is_full() and mlx5e_xdp_mpqwe_is_full() . Signed-off-by: Aya Levin <ayal@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-02-14net/mlx5e: Read max WQEBBs on the SQ from firmwareAya Levin
Prior to this patch the maximal value for max WQEBBs (WQE Basic Blocks, where WQE is a Work Queue Element) on the TX side was assumed to be 16 (fixed value). All firmware versions till today comply to this. In order to be more flexible and resilient, read from FW the corresponding: max_wqe_sz_sq. This value describes the maximum WQE size given in bytes, thus max WQEBBs is given by the division in WQEBB's byte size. The driver uses the top between 16 and the division result. This ensures synchronization between driver and firmware and avoids unexpected behavior. Store this value on the different SQs (Send Queues) for easy access. Signed-off-by: Aya Levin <ayal@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-02-01net/mlx5e: Fix wrong calculation of header index in HW_GROKhalid Manaa
The HW doesn't wrap the CQE.shampo.header_index field according to the headers buffer size, instead it always increases it until reaching overflow of u16 size. Thus the mlx5e_handle_rx_cqe_mpwrq_shampo handler should mask the CQE header_index field to find the actual header index in the headers buffer. Fixes: f97d5c2a453e ("net/mlx5e: Add handle SHAMPO cqe support") Signed-off-by: Khalid Manaa <khalidm@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-26net/mlx5e: Add data path for SHAMPO featureBen Ben-Ishay
The header buffer is used to store the headers of the rx packets. The header buffer size deduced from WorkQueue size + restriction of max packets per WorkQueueElement. This commit adds the functionality for posting/updating memory for the header buffer during the posting/updating of WQEs. Signed-off-by: Ben Ben-Ishay <benishay@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-03-31net/mlx5e: Guarantee room for XSK wakeup NOP on async ICOSQTariq Toukan
XSK wakeup flow triggers an IRQ by posting a NOP WQE and hitting the doorbell on the async ICOSQ. It maintains its state so that it doesn't issue another NOP WQE if it has an outstanding one already. For this flow to work properly, the NOP post must not fail. Make sure to reserve room for the NOP WQE in all WQE posts to the async ICOSQ. Fixes: 8d94b590f1e4 ("net/mlx5e: Turn XSK ICOSQ into a general asynchronous one") Fixes: 1182f3659357 ("net/mlx5e: kTLS, Add kTLS RX HW offload support") Fixes: 0419d8c9d8f8 ("net/mlx5e: kTLS, Add kTLS RX resync support") Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-02-16net/mlx5: Add cyc2time HW translation mode supportAya Levin
Device timestamp can be in real time mode (cycles to time translation is offloaded into the Hardware). With real time mode, HW provides timestamp which is already translated into nanoseconds. With this mode, driver adjusts both the HW and timecounter (to keep clock_info_page updated) using callbacks: adjfreq, adjtime and settime. HW clock modifications are done via MTUTC access reg commands. Driver is allowed to modify HW real time clock only if MCAM ptpcyc2realtime_modify capability is set. Add MTUTC set function to be used for configuring the HW real time clock. Modify existing code to support both internal timer (with conversion via timecounter_cyc2time() and real time (no conversions). Align the signatures of the helpers converting from timestamp to nanoseconds. With that, when allocating a queue assign the corresponding callback with respect to the capability. Adjust 1PPS timestamp calculation flows based on the timestamp mode. Cyc2time offload brings two major advantages: - Improve MTAE (Max Time Absolute Error) for HW TS by up to 160 ns over a 100% loaded CPU. - Faster data-path timestamp to nanoseconds, as translation is lock-less and done in HW. On real time mode, timestamp format is 32 high bits of seconds and 32 low bits of nanoseconds. On some flows, driver shall convert this format into nanoseconds wall-clock with REAL_TIME_TO_NS macro. HW supports a single clock, and it is shared by all functions on a device. In case real time clock is used, it is recommended to use a single GM to all device's functions. Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Aya Levin <ayal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-01-07net/mlx5e: Fix SWP offsets when vlan inserted by driverMoshe Shemesh
In case WQE includes inline header the vlan is inserted by driver even if vlan offload is set. On geneve over vlan interface where software parser is used the SWP offsets should be updated according to the added vlan. Fixes: e3cfc7e6b7bd ("net/mlx5e: TX, Add geneve tunnel stateless offload support") Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2020-12-08net/mlx5e: Move MLX5E_RX_ERR_CQE macroEran Ben Elisha
MLX5E_RX_ERR_CQE Macro is used only in data-path, move it to the appropriate header file. Signed-off-by: Eran Ben Elisha <eranbe@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2020-12-08net/mlx5e: Change skb fifo push/pop API to be used without SQEran Ben Elisha
The skb fifo push/pop API used pre-defined attributes within the mlx5e_txqsq. In order to share the skb fifo API with other non-SQ use cases, change the API input to get newly defined mlx5e_skb_fifo struct. Signed-off-by: Eran Ben Elisha <eranbe@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2020-12-08net/mlx5e: Allow CQ outside of channel contextAya Levin
In order to be able to create a CQ outside of a channel context, remove cq->channel direct pointer. This requires adding a direct pointer to channel statistics, netdevice, priv and to mlx5_core in order to support CQs that are a part of mlx5e_channel. In addition, parameters the were previously derived from the channel like napi, NUMA node, channel stats and index are now assembled in struct mlx5e_create_cq_param which is given to mlx5e_open_cq() instead of channel pointer. Generalizing mlx5e_open_cq() allows opening CQ outside of channel context which will be used in following patches in the patch-set. Signed-off-by: Aya Levin <ayal@nvidia.com> Signed-off-by: Eran Ben Elisha <eranbe@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2020-09-22Merge tag 'mlx5-updates-2020-09-21' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2020-09-21 Multi packet TX descriptor support for SKBs. This series introduces some refactoring of the regular TX data path in mlx5 and adds the Enhanced TX MPWQE feature support. MPWQE stands for multi-packet work queue element, and it can serve multiple packets, reducing the PCI bandwidth spent on control traffic. It should improve performance in scenarios where PCI is the bottleneck, and xmit_more is signaled by the kernel. The refactoring done in this series also improves the packet rate on its own. MPWQE is already implemented in the XDP tx path, this series adds the support of MPWQE for regular kernel SKB tx path. MPWQE is supported from ConnectX-5 and onward, for legacy devices we need to keep backward compatibility for regular (Single packet) WQE descriptor. MPWQE is not compatible with certain offloads and features, such as TLS offload, TSO, nonlinear SKBs. If such incompatible features are in use, the driver gracefully falls back to non-MPWQE per SKB. Prior to the final patch "net/mlx5e: Enhanced TX MPWQE for SKBs" that adds the actual support, Maxim did some refactoring to the tx data path to split it into stages and smaller helper functions that can be utilized and reused for both legacy and new MPWQE feature. Performance testing: UDP performance is improved in a single stream pktgen test: Packet rate: 16.86 Mpps (±0.15 Mpps) -> 20.94 Mpps (±0.33 Mpps) Instructions per packet: 434 -> 329 Cycles per packet: 158 -> 123 Instructions per cycle: 2.75 -> 2.67 TCP and XDP_TX single stream tests show no performance difference. MPWQE can reduce PCI bandwidth: PCI Gen2, pktgen at fixed rate of 36864000 pps on 24 CPU cores: Inbound PCI utilization with MPWQE off: 80.3% Inbound PCI utilization with MPWQE on: 59.0% PCI Gen3, pktgen at fixed rate of 56064000 pps on 24 CPU cores: Inbound PCI utilization with MPWQE off: 65.4% Inbound PCI utilization with MPWQE on: 49.3% MPWQE can also reduce CPU load, increasing the packet rate in case of CPU bottleneck: PCI Gen2, pktgen at full rate on 24 CPU cores: Packet rate with MPWQE off: 37.5 Mpps Packet rate with MPWQE on: 49.0 Mpps PCI Gen3, pktgen at full rate on 24 CPU cores: Packet rate with MPWQE off: 57.0 Mpps Packet rate with MPWQE on: 66.8 Mpps Burst size in all pktgen tests is 32. CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (x86_64) NIC: Mellanox ConnectX-6 Dx GCC 10.2.0 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-21net/mlx5e: Enhanced TX MPWQE for SKBsMaxim Mikityanskiy
This commit adds support for Enhanced TX MPWQE feature in the regular (SKB) data path. A MPWQE (multi-packet work queue element) can serve multiple packets, reducing the PCI bandwidth on control traffic. Two new stats (tx*_mpwqe_blks and tx*_mpwqe_pkts) are added. The feature is on by default and controlled by the skb_tx_mpwqe private flag. In a MPWQE, eseg is shared among all packets, so eseg-based offloads (IPSEC, GENEVE, checksum) run on a separate eseg that is compared to the eseg of the current MPWQE session to decide if the new packet can be added to the same session. MPWQE is not compatible with certain offloads and features, such as TLS offload, TSO, nonlinear SKBs. If such incompatible features are in use, the driver gracefully falls back to non-MPWQE. This change has no performance impact in TCP single stream test and XDP_TX single stream test. UDP pktgen, 64-byte packets, single stream, MPWQE off: Packet rate: 16.96 Mpps (±0.12 Mpps) -> 17.01 Mpps (±0.20 Mpps) Instructions per packet: 421 -> 429 Cycles per packet: 156 -> 161 Instructions per cycle: 2.70 -> 2.67 UDP pktgen, 64-byte packets, single stream, MPWQE on: Packet rate: 16.96 Mpps (±0.12 Mpps) -> 20.94 Mpps (±0.33 Mpps) Instructions per packet: 421 -> 329 Cycles per packet: 156 -> 123 Instructions per cycle: 2.70 -> 2.67 Enabling MPWQE can reduce PCI bandwidth: PCI Gen2, pktgen at fixed rate of 36864000 pps on 24 CPU cores: Inbound PCI utilization with MPWQE off: 80.3% Inbound PCI utilization with MPWQE on: 59.0% PCI Gen3, pktgen at fixed rate of 56064000 pps on 24 CPU cores: Inbound PCI utilization with MPWQE off: 65.4% Inbound PCI utilization with MPWQE on: 49.3% Enabling MPWQE can also reduce CPU load, increasing the packet rate in case of CPU bottleneck: PCI Gen2, pktgen at full rate on 24 CPU cores: Packet rate with MPWQE off: 37.5 Mpps Packet rate with MPWQE on: 49.0 Mpps PCI Gen3, pktgen at full rate on 24 CPU cores: Packet rate with MPWQE off: 57.0 Mpps Packet rate with MPWQE on: 66.8 Mpps Burst size in all pktgen tests is 32. CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (x86_64) NIC: Mellanox ConnectX-6 Dx GCC 10.2.0 Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2020-09-21net/mlx5e: Rename xmit-related structs to generalize themMaxim Mikityanskiy
As preparation for the upcoming TX MPWQE support for SKBs, rename struct mlx5e_xdp_mpwqe to mlx5e_tx_mpwqe and move it above struct mlx5e_txqsq. This structure will be reused in the regular SQ and in the regular TX data path. Also rename mlx5e_xdp_xmit_data to mlx5e_xmit_data - it will be used in the upcoming TX MPWQE flow. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2020-09-21net/mlx5e: Generalize TX MPWQE checks for full sessionMaxim Mikityanskiy
As preparation for the upcoming TX MPWQE for SKBs, create a function (mlx5e_tx_mpwqe_is_full) to check whether an MPWQE session is full. This function will be shared by MPWQE code for XDP and for SKBs. Defines are renamed and moved to make them not XDP-specific. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2020-09-21net/mlx5e: Support multiple SKBs in a TX WQEMaxim Mikityanskiy
TX MPWQE support for SKBs is coming in one of the following patches, and a single MPWQE can send multiple SKBs. This commit prepares the TX path code to handle such cases: 1. An additional FIFO for SKBs is added, just like the FIFO for DMA chunks. 2. struct mlx5e_tx_wqe_info will contain num_fifo_pkts. If a given WQE contains only one packet, num_fifo_pkts will be zero, and the SKB will be stored in mlx5e_tx_wqe_info, as usual. If num_fifo_pkts > 0, the SKB pointer will be NULL, and the SKBs will be stored in the FIFO. This change has no performance impact in TCP single stream test and XDP_TX single stream test. When compiled with a recent GCC, this change shows no visible performance impact on UDP pktgen (burst 32) single stream test either: Packet rate: 16.95 Mpps (±0.15 Mpps) -> 16.96 Mpps (±0.12 Mpps) Instructions per packet: 429 -> 421 Cycles per packet: 160 -> 156 Instructions per cycle: 2.69 -> 2.70 CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (x86_64) NIC: Mellanox ConnectX-6 Dx GCC 10.2.0 Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2020-09-21net/mlx5e: Unify constants for WQE_EMPTY_DS_COUNTMaxim Mikityanskiy
A constant for the number of DS in an empty WQE (i.e. a WQE without data segments) is needed in multiple places (normal TX data path, MPWQE in XDP), but currently we have a constant for XDP and an inline formula in normal TX. This patch introduces a common constant. Additionally, mlx5e_xdp_mpwqe_session_start is converted to use struct assignment, because the code nearby is touched. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2020-09-21net/mlx5e: Refactor xmit functionsMaxim Mikityanskiy
A huge function mlx5e_sq_xmit was split into several to achieve multiple goals: 1. Reuse the code in IPoIB. 2. Better intergrate with TLS, IPSEC, GENEVE and checksum offloads. Now it's possible to reserve space in the WQ before running eseg-based offloads, so: 2.1. It's not needed to copy cseg and eseg after mlx5e_fill_sq_frag_edge anymore. 2.2. mlx5e_txqsq_get_next_pi will be used instead of the legacy mlx5e_fill_sq_frag_edge for better code maintainability and reuse. 3. Prepare for the upcoming TX MPWQE for SKBs. It will intervene after mlx5e_sq_calc_wqe_attr to check if it's possible to use MPWQE, and the code flow will split into two paths: MPWQE and non-MPWQE. Two high-level functions are provided to send packets: * mlx5e_xmit is called by the networking stack, runs offloads and sends the packet. In one of the following patches, MPWQE support will be added to this flow. * mlx5e_sq_xmit_simple is called by the TLS offload, runs only the checksum offload and sends the packet. This change has no performance impact in TCP single stream test and XDP_TX single stream test. When compiled with a recent GCC, this change shows no visible performance impact on UDP pktgen (burst 32) single stream test either: Packet rate: 16.86 Mpps (±0.15 Mpps) -> 16.95 Mpps (±0.15 Mpps) Instructions per packet: 434 -> 429 Cycles per packet: 158 -> 160 Instructions per cycle: 2.75 -> 2.69 CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (x86_64) NIC: Mellanox ConnectX-6 Dx GCC 10.2.0 Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2020-09-21net/mlx5e: Move mlx5e_tx_wqe_inline_mode to en_tx.cMaxim Mikityanskiy
Move mlx5e_tx_wqe_inline_mode from en/txrx.h to en_tx.c as it's only used there. Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2020-09-21net/mlx5e: Fix multicast counter not up-to-date in "ip -s"Ron Diskin
Currently the FW does not generate events for counters other than error counters. Unlike ".get_ethtool_stats", ".ndo_get_stats64" (which ip -s uses) might run in atomic context, while the FW interface is non atomic. Thus, 'ip' is not allowed to issue FW commands, so it will only display cached counters in the driver. Add a SW counter (mcast_packets) in the driver to count rx multicast packets. The counter also counts broadcast packets, as we consider it a special case of multicast. Use the counter value when calling "ip -s"/"ifconfig". Fixes: f62b8bb8f2d3 ("net/mlx5: Extend mlx5_core to support ConnectX-4 Ethernet functionality") Signed-off-by: Ron Diskin <rondi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-07-28net/mlx5: Use fallthrough pseudo-keywordGustavo A. R. Silva
Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-07-28net/mlx5e: Use indirect call wrappers for RX post WQEs functionsTariq Toukan
Use the indirect call wrapper API macros for declaration and scope of the RX post WQEs functions. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-07-28net/mlx5e: Move exposure of datapath function to txrx headerTariq Toukan
Move them from the generic header file "en.h", to the datapath header file "txrx.h". Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-07-02net/mlx5e: Add helper to get the RQ WQE counterAya Levin
Add a helper which retrieves the RQ's WQE counter. Use this helper in the RX reporter diagnose callback. $ devlink health diagnose pci/0000:00:0b.0 reporter rx Common config: RQ: type: 2 stride size: 2048 size: 8 CQ: stride size: 64 size: 1024 RQs: channel ix: 0 rqn: 2113 HW state: 1 SW state: 5 WQE counter: 7 posted WQEs: 7 cc: 7 ICOSQ HW state: 1 CQ: cqn: 1032 HW status: 0 channel ix: 1 rqn: 2118 HW state: 1 SW state: 5 WQE counter: 7 posted WQEs: 7 cc: 7 ICOSQ HW state: 1 CQ: cqn: 1036 HW status: 0 Signed-off-by: Aya Levin <ayal@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-07-02net/mlx5e: Add helper to get RQ WQE's headAya Levin
Add helper which retrieves the RQ WQE's head. Use this helper in RX reporter diagnose callback. Signed-off-by: Aya Levin <ayal@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-07-02net/mlx5e: Move RQ helpers to txrx.hAya Levin
Use txrx.h to contain helper function regarding TX/RX. In the coming patches, I will add more RQ helpers. Signed-off-by: Aya Levin <ayal@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-07-02net/mlx5e: Remove redundant RQ state queryAya Levin
When received a CQE error, the driver inspect the syndrome given by the firmware. RQ recovery is initiated only as a result of a fatal syndrome; syndrome which set the RQ into an error state. Hence no need to query the RQ state at the beginning of the recovery process. Add additional debug prints before recovering. Signed-off-by: Aya Levin <ayal@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-06-27net/mlx5e: kTLS, Add kTLS RX resync supportTariq Toukan
Implement the RX resync procedure, using the TLS async resync API. The HW offload of TLS decryption in RX side might get out-of-sync due to out-of-order reception of packets. This requires SW intervention to update the HW context and get it back in-sync. Performance: CPU: Intel(R) Xeon(R) CPU E5-2687W v4 @ 3.00GHz, 24 cores, HT off NIC: ConnectX-6 Dx 100GbE dual port Goodput (app-layer throughput) comparison: +---------------+-------+-------+---------+ | # connections | 1 | 4 | 8 | +---------------+-------+-------+---------+ | SW (Gbps) | 7.26 | 24.70 | 50.30 | +---------------+-------+-------+---------+ | HW (Gbps) | 18.50 | 64.30 | 92.90 | +---------------+-------+-------+---------+ | Speedup | 2.55x | 2.56x | 1.85x * | +---------------+-------+-------+---------+ * After linerate is reached, diff is observed in CPU util. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-06-27net/mlx5e: kTLS, Add kTLS RX HW offload supportTariq Toukan
Implement driver support for the kTLS RX HW offload feature. Resync support is added in a downstream patch. New offload contexts post their static/progress params WQEs over the per-channel async ICOSQ, protected under a spin-lock. The Channel/RQ is selected according to the socket's rxq index. Feature is OFF by default. Can be turned on by: $ ethtool -K <if> tls-hw-rx-offload on A new TLS-RX workqueue is used to allow asynchronous addition of steering rules, out of the NAPI context. It will be also used in a downstream patch in the resync procedure. Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>