Age | Commit message (Collapse) | Author |
|
Add mutex_destroy() call in driver initialization error flow.
Fixes: 6882b0aee180f ("mlxsw: Introduce support for I2C bus")
Signed-off-by: Vadim Pasternak <vadimp@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Link: https://lore.kernel.org/r/20220407070703.2421076-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"The sprinkling of SPI drivers is because we added a new one and Mark
sent us a SPI driver interface conversion pull request.
Core
----
- Introduce XDP multi-buffer support, allowing the use of XDP with
jumbo frame MTUs and combination with Rx coalescing offloads (LRO).
- Speed up netns dismantling (5x) and lower the memory cost a little.
Remove unnecessary per-netns sockets. Scope some lists to a netns.
Cut down RCU syncing. Use batch methods. Allow netdev registration
to complete out of order.
- Support distinguishing timestamp types (ingress vs egress) and
maintaining them across packet scrubbing points (e.g. redirect).
- Continue the work of annotating packet drop reasons throughout the
stack.
- Switch netdev error counters from an atomic to dynamically
allocated per-CPU counters.
- Rework a few preempt_disable(), local_irq_save() and busy waiting
sections problematic on PREEMPT_RT.
- Extend the ref_tracker to allow catching use-after-free bugs.
BPF
---
- Introduce "packing allocator" for BPF JIT images. JITed code is
marked read only, and used to be allocated at page granularity.
Custom allocator allows for more efficient memory use, lower iTLB
pressure and prevents identity mapping huge pages from getting
split.
- Make use of BTF type annotations (e.g. __user, __percpu) to enforce
the correct probe read access method, add appropriate helpers.
- Convert the BPF preload to use light skeleton and drop the
user-mode-driver dependency.
- Allow XDP BPF_PROG_RUN test infra to send real packets, enabling
its use as a packet generator.
- Allow local storage memory to be allocated with GFP_KERNEL if
called from a hook allowed to sleep.
- Introduce fprobe (multi kprobe) to speed up mass attachment (arch
bits to come later).
- Add unstable conntrack lookup helpers for BPF by using the BPF
kfunc infra.
- Allow cgroup BPF progs to return custom errors to user space.
- Add support for AF_UNIX iterator batching.
- Allow iterator programs to use sleepable helpers.
- Support JIT of add, and, or, xor and xchg atomic ops on arm64.
- Add BTFGen support to bpftool which allows to use CO-RE in kernels
without BTF info.
- Large number of libbpf API improvements, cleanups and deprecations.
Protocols
---------
- Micro-optimize UDPv6 Tx, gaining up to 5% in test on dummy netdev.
- Adjust TSO packet sizes based on min_rtt, allowing very low latency
links (data centers) to always send full-sized TSO super-frames.
- Make IPv6 flow label changes (AKA hash rethink) more configurable,
via sysctl and setsockopt. Distinguish between server and client
behavior.
- VxLAN support to "collect metadata" devices to terminate only
configured VNIs. This is similar to VLAN filtering in the bridge.
- Support inserting IPv6 IOAM information to a fraction of frames.
- Add protocol attribute to IP addresses to allow identifying where
given address comes from (kernel-generated, DHCP etc.)
- Support setting socket and IPv6 options via cmsg on ping6 sockets.
- Reject mis-use of ECN bits in IP headers as part of DSCP/TOS.
Define dscp_t and stop taking ECN bits into account in fib-rules.
- Add support for locked bridge ports (for 802.1X).
- tun: support NAPI for packets received from batched XDP buffs,
doubling the performance in some scenarios.
- IPv6 extension header handling in Open vSwitch.
- Support IPv6 control message load balancing in bonding, prevent
neighbor solicitation and advertisement from using the wrong port.
Support NS/NA monitor selection similar to existing ARP monitor.
- SMC
- improve performance with TCP_CORK and sendfile()
- support auto-corking
- support TCP_NODELAY
- MCTP (Management Component Transport Protocol)
- add user space tag control interface
- I2C binding driver (as specified by DMTF DSP0237)
- Multi-BSSID beacon handling in AP mode for WiFi.
- Bluetooth:
- handle MSFT Monitor Device Event
- add MGMT Adv Monitor Device Found/Lost events
- Multi-Path TCP:
- add support for the SO_SNDTIMEO socket option
- lots of selftest cleanups and improvements
- Increase the max PDU size in CAN ISOTP to 64 kB.
Driver API
----------
- Add HW counters for SW netdevs, a mechanism for devices which
offload packet forwarding to report packet statistics back to
software interfaces such as tunnels.
- Select the default NIC queue count as a fraction of number of
physical CPU cores, instead of hard-coding to 8.
- Expose devlink instance locks to drivers. Allow device layer of
drivers to use that lock directly instead of creating their own
which always runs into ordering issues in devlink callbacks.
- Add header/data split indication to guide user space enabling of
TCP zero-copy Rx.
- Allow configuring completion queue event size.
- Refactor page_pool to enable fragmenting after allocation.
- Add allocation and page reuse statistics to page_pool.
- Improve Multiple Spanning Trees support in the bridge to allow
reuse of topologies across VLANs, saving HW resources in switches.
- DSA (Distributed Switch Architecture):
- replay and offload of host VLAN entries
- offload of static and local FDB entries on LAG interfaces
- FDB isolation and unicast filtering
New hardware / drivers
----------------------
- Ethernet:
- LAN937x T1 PHYs
- Davicom DM9051 SPI NIC driver
- Realtek RTL8367S, RTL8367RB-VB switch and MDIO
- Microchip ksz8563 switches
- Netronome NFP3800 SmartNICs
- Fungible SmartNICs
- MediaTek MT8195 switches
- WiFi:
- mt76: MediaTek mt7916
- mt76: MediaTek mt7921u USB adapters
- brcmfmac: Broadcom BCM43454/6
- Mobile:
- iosm: Intel M.2 7360 WWAN card
Drivers
-------
- Convert many drivers to the new phylink API built for split PCS
designs but also simplifying other cases.
- Intel Ethernet NICs:
- add TTY for GNSS module for E810T device
- improve AF_XDP performance
- GTP-C and GTP-U filter offload
- QinQ VLAN support
- Mellanox Ethernet NICs (mlx5):
- support xdp->data_meta
- multi-buffer XDP
- offload tc push_eth and pop_eth actions
- Netronome Ethernet NICs (nfp):
- flow-independent tc action hardware offload (police / meter)
- AF_XDP
- Other Ethernet NICs:
- at803x: fiber and SFP support
- xgmac: mdio: preamble suppression and custom MDC frequencies
- r8169: enable ASPM L1.2 if system vendor flags it as safe
- macb/gem: ZynqMP SGMII
- hns3: add TX push mode
- dpaa2-eth: software TSO
- lan743x: multi-queue, mdio, SGMII, PTP
- axienet: NAPI and GRO support
- Mellanox Ethernet switches (mlxsw):
- source and dest IP address rewrites
- RJ45 ports
- Marvell Ethernet switches (prestera):
- basic routing offload
- multi-chain TC ACL offload
- NXP embedded Ethernet switches (ocelot & felix):
- PTP over UDP with the ocelot-8021q DSA tagging protocol
- basic QoS classification on Felix DSA switch using dcbnl
- port mirroring for ocelot switches
- Microchip high-speed industrial Ethernet (sparx5):
- offloading of bridge port flooding flags
- PTP Hardware Clock
- Other embedded switches:
- lan966x: PTP Hardward Clock
- qca8k: mdio read/write operations via crafted Ethernet packets
- Qualcomm 802.11ax WiFi (ath11k):
- add LDPC FEC type and 802.11ax High Efficiency data in radiotap
- enable RX PPDU stats in monitor co-exist mode
- Intel WiFi (iwlwifi):
- UHB TAS enablement via BIOS
- band disablement via BIOS
- channel switch offload
- 32 Rx AMPDU sessions in newer devices
- MediaTek WiFi (mt76):
- background radar detection
- thermal management improvements on mt7915
- SAR support for more mt76 platforms
- MBSSID and 6 GHz band on mt7915
- RealTek WiFi:
- rtw89: AP mode
- rtw89: 160 MHz channels and 6 GHz band
- rtw89: hardware scan
- Bluetooth:
- mt7921s: wake on Bluetooth, SCO over I2S, wide-band-speed (WBS)
- Microchip CAN (mcp251xfd):
- multiple RX-FIFOs and runtime configurable RX/TX rings
- internal PLL, runtime PM handling simplification
- improve chip detection and error handling after wakeup"
* tag 'net-next-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2521 commits)
llc: fix netdevice reference leaks in llc_ui_bind()
drivers: ethernet: cpsw: fix panic when interrupt coaleceing is set via ethtool
ice: don't allow to run ice_send_event_to_aux() in atomic ctx
ice: fix 'scheduling while atomic' on aux critical err interrupt
net/sched: fix incorrect vlan_push_eth dest field
net: bridge: mst: Restrict info size queries to bridge ports
net: marvell: prestera: add missing destroy_workqueue() in prestera_module_init()
drivers: net: xgene: Fix regression in CRC stripping
net: geneve: add missing netlink policy and size for IFLA_GENEVE_INNER_PROTO_INHERIT
net: dsa: fix missing host-filtered multicast addresses
net/mlx5e: Fix build warning, detected write beyond size of field
iwlwifi: mvm: Don't fail if PPAG isn't supported
selftests/bpf: Fix kprobe_multi test.
Revert "rethook: x86: Add rethook x86 implementation"
Revert "arm64: rethook: Add arm64 rethook implementation"
Revert "powerpc: Add rethook support"
Revert "ARM: rethook: Add rethook arm implementation"
netdevice: add missing dm_private kdoc
net: bridge: mst: prevent NULL deref in br_mst_info_size()
selftests: forwarding: Use same VRF for port and VLAN upper
...
|
|
Pull VFIO updates from Alex Williamson:
- Introduce new device migration uAPI and implement device specific
mlx5 vfio-pci variant driver supporting new protocol (Jason
Gunthorpe, Yishai Hadas, Leon Romanovsky)
- New HiSilicon acc vfio-pci variant driver, also supporting migration
interface (Shameer Kolothum, Longfang Liu)
- D3hot fixes for vfio-pci-core (Abhishek Sahu)
- Document new vfio-pci variant driver acceptance criteria
(Alex Williamson)
- Fix UML build unresolved ioport_{un}map() functions
(Alex Williamson)
- Fix MAINTAINERS due to header movement (Lukas Bulwahn)
* tag 'vfio-v5.18-rc1' of https://github.com/awilliam/linux-vfio: (31 commits)
vfio-pci: Provide reviewers and acceptance criteria for variant drivers
MAINTAINERS: adjust entry for header movement in hisilicon qm driver
hisi_acc_vfio_pci: Use its own PCI reset_done error handler
hisi_acc_vfio_pci: Add support for VFIO live migration
crypto: hisilicon/qm: Set the VF QM state register
hisi_acc_vfio_pci: Add helper to retrieve the struct pci_driver
hisi_acc_vfio_pci: Restrict access to VF dev BAR2 migration region
hisi_acc_vfio_pci: add new vfio_pci driver for HiSilicon ACC devices
hisi_acc_qm: Move VF PCI device IDs to common header
crypto: hisilicon/qm: Move few definitions to common header
crypto: hisilicon/qm: Move the QM header to include/linux
vfio/mlx5: Fix to not use 0 as NULL pointer
PCI/IOV: Fix wrong kernel-doc identifier
vfio/mlx5: Use its own PCI reset_done error handler
vfio/pci: Expose vfio_pci_core_aer_err_detected()
vfio/mlx5: Implement vfio_pci driver for mlx5 devices
vfio/mlx5: Expose migration commands over mlx5 device
vfio: Remove migration protocol v1 documentation
vfio: Extend the device migration protocol with RUNNING_P2P
vfio: Define device migration protocol v2
...
|
|
When merged with Linus tree, the cited patch below will cause the
following build warning:
In function 'fortify_memset_chk',
inlined from 'mlx5e_xmit_xdp_frame' at drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:438:3:
include/linux/fortify-string.h:242:25: error: call to '__write_overflow_field' declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Werror=attribute-warning]
242 | __write_overflow_field(p_size_field, size);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Fix that by grouping the fields to memeset in struct_group() to avoid
the false alarm.
Fixes: 9ded70fa1d81 ("net/mlx5e: Don't prefill WQEs in XDP SQ in the multi buffer mode")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Suggested-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20220322172224.31849-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Make the devlink core hold the instance lock during eswitch_mode
callbacks. Cheat in case of mlx5 (see the cover letter).
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
There is no function mlx5e_get_sq(), remove the declaration.
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Moshe Tal <moshet@nvidia.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
|
|
Starting from commit
4cab346bcf74 ("net/mlx5: No command allowed when command interface is not ready"),
no calls to mlx5_cmd_trigger_completions() are external to cmd.c anymore.
Make it a static function.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
After introducing multi-buffer XDP_TX, the MLX5E_XDP_TX_DS_COUNT define
became misleading. It's no longer the DS count of an XDP_TX WQE, this
WQE can be longer because of fragments.
As this define is only used at one place in mlx5e_open_xdpsq(), it's
also not very useful anymore. This commit removes the define and puts
the calculation of ds_count for prefilled single-fragment WQEs inline.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Now that legacy RQ implements XDP in the non-linear mode, stop blocking
this configuration. Allow non-linear mode only for programs aware of
multi buffer.
XDP performance with linear mode RQ hasn't changed.
Baseline (MTU 1500, TX MPWQE, legacy RQ, single core):
60-byte packets, XDP_DROP: 11.25 Mpps
60-byte packets, XDP_TX: 9.0 Mpps
60-byte packets, XDP_PASS: 668 kpps
Multi buffer (MTU 9000, TX MPWQE, legacy RQ, single core):
60-byte packets, XDP_DROP: 10.1 Mpps
60-byte packets, XDP_TX: 6.6 Mpps
60-byte packets, XDP_PASS: 658 kpps
8900-byte packets, XDP_DROP: 769 kpps (100% of sent packets)
8900-byte packets, XDP_TX: 674 kpps (100% of sent packets)
8900-byte packets, XDP_PASS: 637 kpps
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
This commit enables passing multi buffer XDP frames to the TX handlers
on XDP_TX. Fragments are DMA synchronized to the device and queued to
the xdpi_fifo for a subsequent unmapping.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
The next commit will add more indentation levels to mlx5e_xmit_xdp_buff.
To keep indentation minimal, unindent the else-block of the if-statement
by doing an early return.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
xmit_xdp_frame is extended to support sending fragmented XDP frames. The
next commit will start using this functionality.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
When MPWQE is disabled, mlx5e_open_xdpsq() prefills the common fields of
WQEs in the XDP SQ to save time when sending packets.
mlx5e_xmit_xdp_frame() runs on the prefilled fields, however, sending
multi buffer XDP frames would require changing some of these fields on a
per-packet basis. Besides that, mlx5e_xmit_xdp_frame() will be used as a
fallback to send multi buffer XDP frames when MPWQE is enabled (MPWQE
can only handle linear packets).
In order to prepare for XDP multi buffer support, this commit introduces
a mode for mlx5e_xmit_xdp_frame() that fills all the fields itself.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
When MPWQE is disabled, mlx5e_open_xdpsq prefills the common fields of
WQEs in the XDP SQ to save time when sending packets. One of such fields
is eseg->inline_hdr.sz, which can be either 0 or MLX5E_XDP_MIN_INLINE,
depending on the inline mode of the SQ.
The inline mode can't change during the lifetime of the SQ, so setting
this field again in mlx5e_xmit_xdp_frame is redundant. Moreover, the
xmit function only sets it to MLX5E_XDP_MIN_INLINE, but not to 0 in the
other case.
This commit removes the redundant assignment in mlx5e_xmit_xdp_frame.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
The implementations of xmit_xdp_frame get the xdpi parameter of type
struct mlx5e_xdp_info for the sole purpose of calling
mlx5e_xdpi_fifo_push() on success.
This commit moves this call outside of xmit_xdp_frame, shifting this
responsibility to the caller. It will allow more fine-grained handling
of XDP info for cases when an xdp_frame is fragmented.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Use page_pool_set_dma_addr() to store the DMA address of a page inside
struct page, in order to avoid passing struct mlx5e_dma_info to XDP
handlers. Previously, struct mlx5e_dma_info was used to pass both the
DMA address and the page, and it worked well for the single-fragment
case.
When XDP multi buffer is in use, and a fragmented xdp_frame has to be
transmitted, the driver needs to know the DMA addresses of fragments,
however, the array of fragments in struct skb_shared_info doesn't
contain them. In order to pass the DMA addresses, the driver puts them
into struct page itself, which is accessible from the array of fragments
in struct skb_shared_info. The existing XDP handlers are modified to
remove the dependency on struct mlx5e_dma_info.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
This commit adds XDP multi buffer support to the RX path in the
non-linear legacy RQ mode. mlx5e_xdp_handle is called from
mlx5e_skb_from_cqe_nonlinear.
XDP_TX action for fragmented XDP frames is not yet supported and
blocked.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
The implementation of XDP in mlx5e assumes that the frame size is equal
to the page size. Force this limitation in the non-linear mode for XDP
multi buffer.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
XDP multi buffer implementation in the kernel assumes that all fragments
have the same size. bpf_xdp_frags_increase_tail uses this assumption to
get the size of the last fragment, and __xdp_build_skb_from_frame uses
it to calculate truesize as nr_frags * xdpf->frame_sz.
The current implementation of mlx5e uses fragments of different size in
non-linear legacy RQ. Specifically, the last fragment can be larger than
the others. It's an optimization for packets smaller than MTU.
This commit adapts mlx5e to the kernel limitations and makes it use
fragments of the same size, in order to add support for XDP multi
buffer. The change is applied only if XDP is active, otherwise the old
optimization still applies.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
mlx5e_skb_from_cqe_nonlinear creates an xdp_buff first, putting the
first fragment as the linear part, and the rest of fragments as
fragments to struct skb_shared_info in the tailroom. Then it creates an
SKB in place, based on the xdp_buff. The XDP program is not called in
this commit yet.
This commit contains no functional change, except the SKB is built over
the whole frag_stride of the first fragment, instead of the minimal size
required (headroom, data and skb_shared_info).
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
mlx5_fill_page_array API function is not used.
Remove it, reduce the number of exported functions.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
All WQ types moved to using the fragmented allocation API
for coherent memory. Contiguous API is not used anymore.
Remove it, reduce the number of exported functions.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
On tuple offload del command, tuples are tried to be removed twice
from the hashtable, once directly via mlx5_tc_ct_entry_remove_from_tuples()
and a second time in the following mlx5_tc_ct_entry_put()->
mlx5_tc_ct_entry_del()->mlx5_tc_ct_entry_remove_from_tuples() call.
This doesn't cause any issue since rhashtable first checks if the
removed object exists in the hashtable.
Remove the extra mlx5_tc_ct_entry_remove_from_tuples().
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
It can be calculated via function mlx5dr_ste_get_hw_ste().
Very simple and lightweight, no need to use a dedicated member.
Reduce 8 bytes from struct mlx5dr_ste and its size is 48 bytes now.
Signed-off-by: Rongwei Liu <rongweil@nvidia.com>
Reviewed-by: Shun Hao <shunh@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Remove chunk_size in struct mlx5dr_icm_chunk and use
chunk->size instead.
Remove ste_arr/hw_ste_arr/miss_list since they can be accessed
from htbl->chunk pointer, no need to keep a copy.
This commit reduces 28 bytes from struct mlx5dr_ste_htbl and its
size is 32 bytes now.
Signed-off-by: Rongwei Liu <rongweil@nvidia.com>
Reviewed-by: Shun Hao <shunh@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Target to reduce the memory consumption in large scale of flow rules.
They can be calculated quickly from buddy memory pool.
1. num_of_entries calls dr_icm_pool_get_chunk_num_of_entries().
2. byte_size calls dr_icm_pool_get_chunk_byte_size().
Use chunk size in dr_icm_chunk to speed up and the one in dr_ste_htbl
will be removed in the upcoming commit.
This commit reduce 8 bytes from struct mlx5_dr_icm_chunk and its
current size is 56 bytes.
Signed-off-by: Rongwei Liu <rongweil@nvidia.com>
Reviewed-by: Shun Hao <shunh@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
It can be calculated quickly from buddy memory pool by
function mlx5dr_icm_pool_get_chunk_icm_addr().
This function is very lightweight and straightforward.
Reduce 8 bytes and current size of struct mlx5_dr_icm_chunk
is 64 bytes.
Signed-off-by: Rongwei Liu <rongweil@nvidia.com>
Reviewed-by: Shun Hao <shunh@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Reduce memory footprint by removing mr_addr and rkey from
mlx5_dr_icm_chunk.
1. mr_addr is calculated by mlx5dr_icm_pool_get_chunk_mr_addr()
2. rkey is calculated by mlx5dr_icm_pool_get_chunk_rkey()
The two new functions are very lightweight and straightforward.
Reduce 8 bytes from struct mlx5_dr_icm_chunk, its current size is
72 bytes.
Signed-off-by: Rongwei Liu <rongweil@nvidia.com>
Reviewed-by: Shun Hao <shunh@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Accord to profiling, mlx5dr_ste/mlx5dr_icm_chunk are the two
hot structures. Their memory layout can be optimized by
adjusting member sequences.
Struct mlx5dr_ste size changes from 64 bytes to 56 bytes.
In the upcoming commits, struct mlx5dr_icm_chunk memory layout
will change automatically after removing some members.
Keep it untouched here.
Signed-off-by: Rongwei Liu <rongweil@nvidia.com>
Reviewed-by: Shun Hao <shunh@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
The packet size in mlx5e_skb_from_cqe_mpwrq_linear can't overflow u16,
since the maximum packet size in linear striding RQ is 2^13 bytes. Drop
the unneeded u32 variable.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
The len parameter of mlx5e_xdp_handle is used to output the new packet
length after XDP has processed the packet and returned XDP_PASS.
However, this value can be calculated on the caller site, as the caller
knows if it was an XDP_PASS.
This commit drops the len parameter and moves the calculation to the
caller, reducing the number of parameters passed to the function and
preparing for XDP support in non-linear legacy RQ.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Instead of early return inside mlx5e_xdp_handle(), let the caller check
if an XDP program is loaded. This allows saving a few unnecessary
function calls and calculations in case !prog.
Performance test: single core, drop packets in iptables
Before: 3,872,504 pps
After: 3,975,628 pps (+2.66%)
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
As a performance optimization and preparation to enabling XDP multi
buffer on non-linear legacy RQ, build the linear part of the SKB over
the first fragment, instead of allocating a new buffer and copying the
first 256 bytes there.
To achieve this, add headroom and tailroom to the first fragment.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Currently, rq->buff.headroom is applied to all fragments in legacy RQ.
In the linear mode, there is a non-zero headroom, but there is only one
fragment per packet. In the non-linear mode, the headroom is zero.
This commit changes the logic to apply the headroom only to the first
fragment. The current behavior remains the same for both linear and
non-linear modes. However, it allows the next commit to enable headroom
for the non-linear mode, which will be applied only to the first
fragment.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
mlx5e_build_rq_frags_info() assumes that MTU is not bigger than
PAGE_SIZE * MLX5E_MAX_RX_FRAGS, which is 16K for 4K pages. Currently,
the firmware limits MTU to 10K, so the assumption doesn't lead to a bug.
This commits adds an additional driver check for reliability, since the
firmware boundary might be changed.
The calculation is taken to a separate function with a comment
explaining it. It's a preparation for the following patches that
introcuce XDP multi buffer support.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Currently the MPLSoUDP encap offload does the L2 pop implicitly
while adding such action explicitly (vlan eth_push) will cause
the rule to not be offloaded.
Solve it by adding offload support for vlan eth_push in case of
MPLSoUDP decap case.
Flow example:
filter root protocol ip pref 1 flower chain 0
filter root protocol ip pref 1 flower chain 0 handle 0x1
eth_type ipv4
dst_ip 2.2.2.22
src_ip 2.2.2.21
in_hw in_hw_count 1
action order 1: vlan pop_eth pipe
index 1 ref 1 bind 1
used_hw_stats delayed
action order 2: mpls push protocol mpls_uc label 555 tc 3 ttl 255 pipe
index 1 ref 1 bind 1
used_hw_stats delayed
action order 3: tunnel_key set
src_ip 8.8.8.21
dst_ip 8.8.8.22
dst_port 6635
csum
tos 0x4
ttl 6 pipe
index 1 ref 1 bind 1
used_hw_stats delayed
action order 4: mirred (Egress Redirect to device bareudp0) stolen
index 1 ref 1 bind 1
used_hw_stats delayed
Signed-off-by: Maor Dickman <maord@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently action pedit of source and destination MACs is used
to fill the MACs in L2 push step in MPLSoUDP decap offload,
this isn't aligned to tc SW which use vlan eth_push action
to do this.
To fix that, offload support for vlan veth_push action is
added together with mpls pop action, and deprecate the use
of pedit of MACs.
Flow example:
filter protocol mpls_uc pref 1 flower chain 0
filter protocol mpls_uc pref 1 flower chain 0 handle 0x1
eth_type 8847
mpls_label 555
enc_dst_port 6635
in_hw in_hw_count 1
action order 1: tunnel_key unset pipe
index 2 ref 1 bind 1
used_hw_stats delayed
action order 2: mpls pop protocol ip pipe
index 2 ref 1 bind 1
used_hw_stats delayed
action order 3: vlan push_eth dst_mac de:a2:ec:d6:69:c8 src_mac de:a2:ec:d6:69:c8 pipe
index 2 ref 1 bind 1
used_hw_stats delayed
action order 4: mirred (Egress Redirect to device enp8s0f0_0) stolen
index 2 ref 1 bind 1
used_hw_stats delayed
Signed-off-by: Maor Dickman <maord@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Now that devlink ports are protected by the instance lock
it seems natural to pass devlink_port as an argument to
the port_split / port_unsplit callbacks.
This should save the drivers from doing a lookup.
In theory drivers may have supported unsplitting ports
which were not registered prior to this change.
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Let the core take the devlink instance lock around port splitting
and remove the now redundant locking in the drivers.
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Explicitly lock the devlink instance and use devl_ API.
This will be used by the subsequent patch to invoke
.port_split / .port_unsplit callbacks with devlink
instance lock held.
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
1) Revert CHECKSUM_UNNECESSARY for UDP packet from conntrack.
2) Reject unsupported families when creating tables, from Phil Sutter.
3) GRE support for the flowtable, from Toshiaki Makita.
4) Add GRE offload support for act_ct, also from Toshiaki.
5) Update mlx5 driver to support for GRE flowtable offload,
from Toshiaki Makita.
6) Oneliner to clean up incorrect indentation in nf_conntrack_bridge,
from Jiapeng Chong.
* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: bridge: clean up some inconsistent indenting
net/mlx5: Support GRE conntrack offload
act_ct: Support GRE offload
netfilter: flowtable: Support GRE
netfilter: nf_tables: Reject tables of unsupported family
Revert "netfilter: conntrack: mark UDP zero checksum as CHECKSUM_UNNECESSARY"
====================
Link: https://lore.kernel.org/r/20220315091513.66544-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
We need to sync page pool stats only for active channels. Reading ethtool
stats on a down netdev or a netdev with modified number of channels will
result in a user-after-free, trying to access page pools that are freed
already.
BUG: KASAN: use-after-free in mlx5e_stats_grp_sw_update_stats+0x465/0xf80
Read of size 8 at addr ffff888004835e40 by task ethtool/720
Fixes: cc10e84b2ec3 ("mlx5: add support for page_pool_get_stats")
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reported-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Joe Damato <jdamato@fastly.com>
Link: https://lore.kernel.org/r/20220312005353.786255-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Use kzalloc instead of kmalloc + memset.
The semantic patch that makes this change is:
(https://coccinelle.gitlabpages.inria.fr/website/)
//<smpl>
@@
expression res, size, flag;
@@
- res = kmalloc(size, flag);
+ res = kzalloc(size, flag);
...
- memset(res, 0, size);
//</smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20220312102705.71413-3-Julia.Lawall@inria.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Unlike the legacy EEPROM callbacks, when using the netlink EEPROM query
(get_module_eeprom_by_page) the driver should not try to validate the
query parameters, but just perform the read requested by the userspace.
Recent discussion in the mailing list:
https://lore.kernel.org/netdev/20220120093051.70845141@kicinski-fedora-PC1C0HJN.hsd1.ca.comcast.net/
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
The assumption that the first byte in the module mapping dword is the
module number shouldn't be hard-coded in the driver, but come from
mlx5_ifc structs.
While at it, fix the incorrect width for the 'rx_lane' and 'tx_lane'
fields.
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
The MCIA register supports either 12 or 32 dwords, use the correct value
by querying the capability from the MCAM register.
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
SMFS dr matchers are processed sequentially in hardware according to
their priorities, and not skipped if empty.
Currently, smfs ct fs creates four predefined dr matchers per ct
table (ct/ct nat) with hardcoded priority. Compared to dmfs ct fs
using autogroups, this might cause additional hops in fastpath for
traffic patterns that match later priorties, even if previous
priorites are empty, e.g user only using ipv6 UDP traffic will
have additional 3 hops.
Create the matchers dynamically, using the highest priority available,
on first rule usage, and remove them on last usage.
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
fs_core layer adds extra book keeping that is either unneeded for CT, or
unused by the underlying software steering, such as allocating FTEs and
FTE ids, saving the match key and mask, and autogroups management.
On top of that, direct steering has a translation layer (fs_dr) from PRM
commands to direct steering objects, for example, creating temporary
dr_action objects. This has a performance impact when dealing
with CT high insertion rate.
To use direct steering (smfs) directly for ct, add a tc ct fs smfs
implementation. Instead of dmfs autogroups, smfs ct fs uses one of 4
predefined dr matchers in CT and CT-NAT tables, for each combination
of tuple ethertype (ipv4/ipv6), and tuple ip_proto (udp/tcp) that
is currently used by nf flow table flow offload.
At rule insertions, validate the flow rule fits one of the predfined
matcher, and insert to it.
To fill the dr_actions of the rule efficiently, create the fwd to post_ct
tbl dr_action at fs init, the count dr_action at counter creation,
and re-use the already pre-allocated modify header dr_action.
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Add a thin layer that exports selected direct steering (dr) API
which will be used by a ct fs implementation in a following
patch.
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
If sw steering was used to create the table, dr steeering fs creates
a backing dr table for the mlx5 flow table.
Add helper to return this table so it can be used to create matchers and
add rules on it directly instead of passing via eswitch_offloads/fs_core
insertion.
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|