summaryrefslogtreecommitdiff
path: root/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
AgeCommit message (Collapse)Author
2025-05-22net/mlx5e: Convert mlx5 netdevs to instance lockingCosmin Ratiu
This patch convert mlx5 to use the new netdev instance lock in addition to the pre-existing state_lock (and the RTNL). mlx5e_priv.state_lock was already used throughout mlx5 to protect against concurrent state modifications on the same netdev, usually in addition to the RTNL. The new netdev instance lock will eventually replace it, but for now, it is acquired in addition to the existing locks in the order RTNL -> instance lock -> state_lock. All three netdev types handled by mlx5 are converted to the new style of locking, because they share a lot of code related to initializing channels and dealing with NAPI, so it's better to convert all three rather than introduce different assumptions deep in the call stack depending on the type of device. Because of the nature of the call graphs in mlx5, it wasn't possible to incrementally convert parts of the driver to use the new lock, since either all call paths into NAPI have to possess the new lock if the *_locked variants are used, or none of them can have the lock. One area which required extra care is the interaction between closing channels and devlink health reporter tasks. Previously, the recovery tasks were unconditionally acquiring the RTNL, which could lead to deadlocks in these scenarios: T1: mlx5e_close (== .ndo_stop(), has RTNL) -> mlx5e_close_locked -> mlx5e_close_channels -> mlx5e_ptp_close -> mlx5e_ptp_close_queues -> mlx5e_ptp_close_txqsqs -> mlx5e_ptp_close_txqsq -> cancel_work_sync(&ptpsq->report_unhealthy_work) waits for T2: mlx5e_ptpsq_unhealthy_work -> mlx5e_reporter_tx_ptpsq_unhealthy -> mlx5e_health_report -> devlink_health_report -> devlink_health_reporter_recover -> mlx5e_tx_reporter_ptpsq_unhealthy_recover which does: rtnl_lock(); => Deadlock. Another similar instance of this is: T1: mlx5e_close (== .ndo_stop(), has RTNL) -> mlx5e_close_locked -> mlx5e_close_channels -> mlx5e_ptp_close -> mlx5e_ptp_close_queues -> mlx5e_ptp_close_txqsqs -> mlx5e_ptp_close_txqsq -> cancel_work_sync(&sq->recover_work) waits for T2: mlx5e_tx_err_cqe_work -> mlx5e_reporter_tx_err_cqe -> mlx5e_health_report -> devlink_health_report -> devlink_health_reporter_recover -> mlx5e_tx_reporter_err_cqe_recover which does: rtnl_lock(); => Another deadlock. Fix that by using the same pattern previously done in mlx5e_tx_timeout_work, where the RTNL was repeatedly tried to be acquired until either: a) it is successfully acquired or b) there's no need for the work to be done any more (channel is being closed). Now, for all three recovery tasks, the instance lock is repeatedly tried to be acquired until successful or the channel/SQ is closed. As a side-effect, drop the !test_bit(MLX5E_STATE_OPENED, &priv->state) check from mlx5e_tx_timeout_work, it's weaker than !test_bit(MLX5E_STATE_CHANNELS_ACTIVE, &priv->state) and unnecessary. Future patches will introduce new call paths (from netdev queue management ops) which can close channels (and call cancel_work_sync on the recovery tasks) without the RTNL lock and only with the netdev instance lock. Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1747829342-1018757-6-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-13net/mlx5e: Disable MACsec offload for uplink representor profileCarolina Jubran
MACsec offload is not supported in switchdev mode for uplink representors. When switching to the uplink representor profile, the MACsec offload feature must be cleared from the netdevice's features. If left enabled, attempts to add offloads result in a null pointer dereference, as the uplink representor does not support MACsec offload even though the feature bit remains set. Clear NETIF_F_HW_MACSEC in mlx5e_fix_uplink_rep_features(). Kernel log: Oops: general protection fault, probably for non-canonical address 0xdffffc000000000f: 0000 [#1] SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000078-0x000000000000007f] CPU: 29 UID: 0 PID: 4714 Comm: ip Not tainted 6.14.0-rc4_for_upstream_debug_2025_03_02_17_35 #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:__mutex_lock+0x128/0x1dd0 Code: d0 7c 08 84 d2 0f 85 ad 15 00 00 8b 35 91 5c fe 03 85 f6 75 29 49 8d 7e 60 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 a6 15 00 00 4d 3b 76 60 0f 85 fd 0b 00 00 65 ff RSP: 0018:ffff888147a4f160 EFLAGS: 00010206 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000001 RDX: 000000000000000f RSI: 0000000000000000 RDI: 0000000000000078 RBP: ffff888147a4f2e0 R08: ffffffffa05d2c19 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000 R13: dffffc0000000000 R14: 0000000000000018 R15: ffff888152de0000 FS: 00007f855e27d800(0000) GS:ffff88881ee80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000004e5768 CR3: 000000013ae7c005 CR4: 0000000000372eb0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 Call Trace: <TASK> ? die_addr+0x3d/0xa0 ? exc_general_protection+0x144/0x220 ? asm_exc_general_protection+0x22/0x30 ? mlx5e_macsec_add_secy+0xf9/0x700 [mlx5_core] ? __mutex_lock+0x128/0x1dd0 ? lockdep_set_lock_cmp_fn+0x190/0x190 ? mlx5e_macsec_add_secy+0xf9/0x700 [mlx5_core] ? mutex_lock_io_nested+0x1ae0/0x1ae0 ? lock_acquire+0x1c2/0x530 ? macsec_upd_offload+0x145/0x380 ? lockdep_hardirqs_on_prepare+0x400/0x400 ? kasan_save_stack+0x30/0x40 ? kasan_save_stack+0x20/0x40 ? kasan_save_track+0x10/0x30 ? __kasan_kmalloc+0x77/0x90 ? __kmalloc_noprof+0x249/0x6b0 ? genl_family_rcv_msg_attrs_parse.constprop.0+0xb5/0x240 ? mlx5e_macsec_add_secy+0xf9/0x700 [mlx5_core] mlx5e_macsec_add_secy+0xf9/0x700 [mlx5_core] ? mlx5e_macsec_add_rxsa+0x11a0/0x11a0 [mlx5_core] macsec_update_offload+0x26c/0x820 ? macsec_set_mac_address+0x4b0/0x4b0 ? lockdep_hardirqs_on_prepare+0x284/0x400 ? _raw_spin_unlock_irqrestore+0x47/0x50 macsec_upd_offload+0x2c8/0x380 ? macsec_update_offload+0x820/0x820 ? __nla_parse+0x22/0x30 ? genl_family_rcv_msg_attrs_parse.constprop.0+0x15e/0x240 genl_family_rcv_msg_doit+0x1cc/0x2a0 ? genl_family_rcv_msg_attrs_parse.constprop.0+0x240/0x240 ? cap_capable+0xd4/0x330 genl_rcv_msg+0x3ea/0x670 ? genl_family_rcv_msg_dumpit+0x2a0/0x2a0 ? lockdep_set_lock_cmp_fn+0x190/0x190 ? macsec_update_offload+0x820/0x820 netlink_rcv_skb+0x12b/0x390 ? genl_family_rcv_msg_dumpit+0x2a0/0x2a0 ? netlink_ack+0xd80/0xd80 ? rwsem_down_read_slowpath+0xf90/0xf90 ? netlink_deliver_tap+0xcd/0xac0 ? netlink_deliver_tap+0x155/0xac0 ? _copy_from_iter+0x1bb/0x12c0 genl_rcv+0x24/0x40 netlink_unicast+0x440/0x700 ? netlink_attachskb+0x760/0x760 ? lock_acquire+0x1c2/0x530 ? __might_fault+0xbb/0x170 netlink_sendmsg+0x749/0xc10 ? netlink_unicast+0x700/0x700 ? __might_fault+0xbb/0x170 ? netlink_unicast+0x700/0x700 __sock_sendmsg+0xc5/0x190 ____sys_sendmsg+0x53f/0x760 ? import_iovec+0x7/0x10 ? kernel_sendmsg+0x30/0x30 ? __copy_msghdr+0x3c0/0x3c0 ? filter_irq_stacks+0x90/0x90 ? stack_depot_save_flags+0x28/0xa30 ___sys_sendmsg+0xeb/0x170 ? kasan_save_stack+0x30/0x40 ? copy_msghdr_from_user+0x110/0x110 ? do_syscall_64+0x6d/0x140 ? lock_acquire+0x1c2/0x530 ? __virt_addr_valid+0x116/0x3b0 ? __virt_addr_valid+0x1da/0x3b0 ? lock_downgrade+0x680/0x680 ? __delete_object+0x21/0x50 __sys_sendmsg+0xf7/0x180 ? __sys_sendmsg_sock+0x20/0x20 ? kmem_cache_free+0x14c/0x4e0 ? __x64_sys_close+0x78/0xd0 do_syscall_64+0x6d/0x140 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x7f855e113367 Code: 0e 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b9 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10 RSP: 002b:00007ffd15e90c88 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f855e113367 RDX: 0000000000000000 RSI: 00007ffd15e90cf0 RDI: 0000000000000004 RBP: 00007ffd15e90dbc R08: 0000000000000028 R09: 000000000045d100 R10: 00007f855e011dd8 R11: 0000000000000246 R12: 0000000000000019 R13: 0000000067c6b785 R14: 00000000004a1e80 R15: 0000000000000000 </TASK> Modules linked in: 8021q garp mrp sch_ingress openvswitch nsh mlx5_ib mlx5_fwctl mlx5_dpll mlx5_core rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm ib_uverbs ib_core xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter rpcsec_gss_krb5 auth_rpcgss oid_registry overlay zram zsmalloc fuse [last unloaded: mlx5_core] ---[ end trace 0000000000000000 ]--- Fixes: 8ff0ac5be144 ("net/mlx5: Add MACsec offload Tx command support") Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Shahar Shitrit <shshitrit@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/1746958552-561295-1-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-24net/mlx5e: Use right API to free bitmap memoryMark Zhang
Use bitmap_free() to free memory allocated with bitmap_zalloc_node(). This fixes memtrack error: mtl rsc inconsistency: memtrack_free: .../drivers/net/ethernet/mellanox/mlx5/core/en_main.c::466: kfree for unknown address=0xFFFF0000CA3619E8, device=0x0 Signed-off-by: Mark Zhang <markzhang@nvidia.com> Reviewed-by: Maher Sanalla <msanalla@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Link: https://patch.msgid.link/1742412199-159596-3-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netPaolo Abeni
Cross-merge networking fixes after downstream PR (net-6.14-rc6). Conflicts: tools/testing/selftests/drivers/net/ping.py 75cc19c8ff89 ("selftests: drv-net: add xdp cases for ping.py") de94e8697405 ("selftests: drv-net: store addresses in dict indexed by ipver") https://lore.kernel.org/netdev/20250311115758.17a1d414@canb.auug.org.au/ net/core/devmem.c a70f891e0fa0 ("net: devmem: do not WARN conditionally after netdev_rx_queue_restart()") 1d22d3060b9b ("net: drop rtnl_lock for queue_mgmt operations") https://lore.kernel.org/netdev/20250313114929.43744df1@canb.auug.org.au/ Adjacent changes: tools/testing/selftests/net/Makefile 6f50175ccad4 ("selftests: Add IPv6 link-local address generation tests for GRE devices.") 2e5584e0f913 ("selftests/net: expand cmsg_ipv6.sh with ipv4") drivers/net/ethernet/broadcom/bnxt/bnxt.c 661958552eda ("eth: bnxt: do not use BNXT_VNIC_NTUPLE unconditionally in queue restart logic") fe96d717d38e ("bnxt_en: Extend queue stop/start for TX rings") Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-13net/mlx5e: Prevent bridge link show failure for non-eswitch-allowed devicesCarolina Jubran
mlx5_eswitch_get_vepa returns -EPERM if the device lacks eswitch_manager capability, blocking mlx5e_bridge_getlink from retrieving VEPA mode. Since mlx5e_bridge_getlink implements ndo_bridge_getlink, returning -EPERM causes bridge link show to fail instead of skipping devices without this capability. To avoid this, return -EOPNOTSUPP from mlx5e_bridge_getlink when mlx5_eswitch_get_vepa fails, ensuring the command continues processing other devices while ignoring those without the necessary capability. Fixes: 4b89251de024 ("net/mlx5: Support ndo bridge_setlink and getlink") Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Jianbo Liu <jianbol@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Link: https://patch.msgid.link/1741644104-97767-7-git-send-email-tariqt@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-04net: rename netns_local to netns_immutableNicolas Dichtel
The name 'netns_local' is confusing. A following commit will export it via netlink, so let's use a more explicit name. Reported-by: Eric Dumazet <edumazet@google.com> Suggested-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-27net/mlx5e: Avoid a hundred -Wflex-array-member-not-at-end warningsGustavo A. R. Silva
-Wflex-array-member-not-at-end was introduced in GCC-14, and we are getting ready to enable it, globally. So, in this particular case, we create a new `struct mlx5e_umr_wqe_hdr` to enclose the header part of flexible structure `struct mlx5e_umr_wqe`. This is, all the members except the flexible arrays `inline_mtts`, `inline_klms` and `inline_ksms` in the anonymous union. We then replace the header part with `struct mlx5e_umr_wqe_hdr hdr;` in `struct mlx5e_umr_wqe`, and change the type of the object currently causing trouble `umr_wqe` from `struct mlx5e_umr_wqe` to `struct mlx5e_umr_wqe_hdr` --this last bit gets rid of the flex-array-in-the-middle part and avoid the warnings. Also, no new members should be added to `struct mlx5e_umr_wqe`, instead any new members must be included in the header structure `struct mlx5e_umr_wqe_hdr`. To enforce this, we use `static_assert()`, ensuring that the memory layout of both the flexible structure and the newly created header struct remain consistent. The next step is to refactor the rest of the related code accordingly, which means adding a bunch of `hdr.` wherever needed. Lastly, we use `container_of()` whenever we need to retrieve a pointer to the flexible structure `struct mlx5e_umr_wqe`. So, with these changes, fix 125 of the following warnings: drivers/net/ethernet/mellanox/mlx5/core/en.h:664:48: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Link: https://patch.msgid.link/Z76HzPW1dFTLOSSy@kspp Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-12net/mlx5: XDP, Enable TX side XDP multi-buffer supportAlexei Lazar
In XDP scenarios, fragmented packets can occur if the MTU is larger than the page size, even when the packet size fits within the linear part. If XDP multi-buffer support is disabled, the fragmented part won't be handled in the TX flow, leading to packet drops. Since XDP multi-buffer support is always available, this commit removes the conditional check for enabling it. This ensures that XDP multi-buffer support is always enabled, regardless of the `is_xdp_mb` parameter, and guarantees the handling of fragmented packets in such scenarios. Signed-off-by: Alexei Lazar <alazar@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20250209101716.112774-16-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-06net/mlx5e: Avoid WARN_ON when configuring MQPRIO with HTB offload enabledCarolina Jubran
When attempting to enable MQPRIO while HTB offload is already configured, the driver currently returns `-EINVAL` and triggers a `WARN_ON`, leading to an unnecessary call trace. Update the code to handle this case more gracefully by returning `-EOPNOTSUPP` instead, while also providing a helpful user message. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Yael Chemla <ychemla@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-06net/mlx5: Change clock in mlx5_core_dev to mlx5_clock pointerJianbo Liu
Change clock member in mlx5_core_dev to a pointer, so it can point to a clock shared by multiple functions in later patch. For now, each function has its own clock, so mdev in mlx5_clock_priv is the back pointer to the function. Later it points to one (normally the first one) of the multiple functions sharing the same clock. Change mlx5_init_clock() to return error if mlx5_clock is not allocated. Besides, a null clock is defined and used when hardware clock is not supported. So, the clock pointer is always pointing to something valid. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-27net/mlx5e: add missing cpu_to_node to kvzalloc_node in mlx5e_open_xdpredirect_sqStanislav Fomichev
kvzalloc_node is not doing a runtime check on the node argument (__alloc_pages_node_noprof does have a VM_BUG_ON, but it expands to nothing on !CONFIG_DEBUG_VM builds), so doing any ethtool/netlink operation that calls mlx5e_open on a CPU that's larger that MAX_NUMNODES triggers OOB access and panic (see the trace below). Add missing cpu_to_node call to convert cpu id to node id. [ 165.427394] mlx5_core 0000:5c:00.0 beth1: Link up [ 166.479327] BUG: unable to handle page fault for address: 0000000800000010 [ 166.494592] #PF: supervisor read access in kernel mode [ 166.505995] #PF: error_code(0x0000) - not-present page ... [ 166.816958] Call Trace: [ 166.822380] <TASK> [ 166.827034] ? __die_body+0x64/0xb0 [ 166.834774] ? page_fault_oops+0x2cd/0x3f0 [ 166.843862] ? exc_page_fault+0x63/0x130 [ 166.852564] ? asm_exc_page_fault+0x22/0x30 [ 166.861843] ? __kvmalloc_node_noprof+0x43/0xd0 [ 166.871897] ? get_partial_node+0x1c/0x320 [ 166.880983] ? deactivate_slab+0x269/0x2b0 [ 166.890069] ___slab_alloc+0x521/0xa90 [ 166.898389] ? __kvmalloc_node_noprof+0x43/0xd0 [ 166.908442] __kmalloc_node_noprof+0x216/0x3f0 [ 166.918302] ? __kvmalloc_node_noprof+0x43/0xd0 [ 166.928354] __kvmalloc_node_noprof+0x43/0xd0 [ 166.938021] mlx5e_open_channels+0x5e2/0xc00 [ 166.947496] mlx5e_open_locked+0x3e/0xf0 [ 166.956201] mlx5e_open+0x23/0x50 [ 166.963551] __dev_open+0x114/0x1c0 [ 166.971292] __dev_change_flags+0xa2/0x1b0 [ 166.980378] dev_change_flags+0x21/0x60 [ 166.988887] do_setlink+0x38d/0xf20 [ 166.996628] ? ep_poll_callback+0x1b9/0x240 [ 167.005910] ? __nla_validate_parse.llvm.10713395753544950386+0x80/0xd70 [ 167.020782] ? __wake_up_sync_key+0x52/0x80 [ 167.030066] ? __mutex_lock+0xff/0x550 [ 167.038382] ? security_capable+0x50/0x90 [ 167.047279] rtnl_setlink+0x1c9/0x210 [ 167.055403] ? ep_poll_callback+0x1b9/0x240 [ 167.064684] ? security_capable+0x50/0x90 [ 167.073579] rtnetlink_rcv_msg+0x2f9/0x310 [ 167.082667] ? rtnetlink_bind+0x30/0x30 [ 167.091173] netlink_rcv_skb+0xb1/0xe0 [ 167.099492] netlink_unicast+0x20f/0x2e0 [ 167.108191] netlink_sendmsg+0x389/0x420 [ 167.116896] __sys_sendto+0x158/0x1c0 [ 167.125024] __x64_sys_sendto+0x22/0x30 [ 167.133534] do_syscall_64+0x63/0x130 [ 167.141657] ? __irq_exit_rcu.llvm.17843942359718260576+0x52/0xd0 [ 167.155181] entry_SYSCALL_64_after_hwframe+0x4b/0x53 Fixes: bb135e40129d ("net/mlx5e: move XDP_REDIRECT sq to dynamic allocation") Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20250123000407.3464715-1-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.13-rc6). No conflicts. Adjacent changes: include/linux/if_vlan.h f91a5b808938 ("af_packet: fix vlan_get_protocol_dgram() vs MSG_PEEK") 3f330db30638 ("net: reformat kdoc return statements") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-23net/mlx5e: Keep netdev when leave switchdev for devlink set legacy onlyJianbo Liu
In the cited commit, when changing from switchdev to legacy mode, uplink representor's netdev is kept, and its profile is replaced with nic profile, so netdev is detached from old profile, then attach to new profile. During profile change, the hardware resources allocated by the old profile will be cleaned up. However, the cleanup is relying on the related kernel modules. And they may need to flush themselves first, which is triggered by netdev events, for example, NETDEV_UNREGISTER. However, netdev is kept, or netdev_register is called after the cleanup, which may cause troubles because the resources are still referred by kernel modules. The same process applies to all the caes when uplink is leaving switchdev mode, including devlink eswitch mode set legacy, driver unload and devlink reload. For the first one, it can be blocked and returns failure to users, whenever possible. But it's hard for the others. Besides, the attachment to nic profile is unnecessary as the netdev will be unregistered anyway for such cases. So in this patch, the original behavior is kept only for devlink eswitch set mode legacy. For the others, moves netdev unregistration before the profile change. Fixes: 7a9fb35e8c3a ("net/mlx5e: Do not reload ethernet ports when changing eswitch mode") Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241220081505.1286093-5-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-18net/mlx5e: Report rx_discards_phy via rx_droppedYafang Shao
We noticed a high number of rx_discards_phy events on certain servers while running `ethtool -S`. However, this critical counter is not currently included in the standard /proc/net/dev statistics file, making it difficult to monitor effectively—especially given the diversity of vendors across a large fleet of servers. Let's report it via the standard rx_dropped metric. Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Cc: Saeed Mahameed <saeedm@nvidia.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: Gal Pressman <gal@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241210022706.6665-1-laoar.shao@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-04net/mlx5e: SD, Use correct mdev to build channel paramTariq Toukan
In a multi-PF netdev, each traffic channel creates its own resources against a specific PF. In the cited commit, where this support was added, the channel_param logic was mistakenly kept unchanged, so it always used the primary PF which is found at priv->mdev. In this patch we fix this by moving the logic to be per-channel, and passing the correct mdev instance. This bug happened to be usually harmless, as the resulting cparam structures would be the same for all channels, due to identical FW logic and decisions. However, in some use cases, like fwreset, this gets broken. This could lead to different symptoms. Example: Error cqe on cqn 0x428, ci 0x0, qn 0x10a9, opcode 0xe, syndrome 0x4, vendor syndrome 0x32 Fixes: e4f9686bdee7 ("net/mlx5e: Let channels be SD-aware") Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Lama Kayal <lkayal@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Link: https://patch.msgid.link/20241203204920.232744-6-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.12-rc8). Conflicts: tools/testing/selftests/net/.gitignore 252e01e68241 ("selftests: net: add netlink-dumps to .gitignore") be43a6b23829 ("selftests: ncdevmem: Move ncdevmem under drivers/net/hw") https://lore.kernel.org/all/20241113122359.1b95180a@canb.auug.org.au/ drivers/net/phy/phylink.c 671154f174e0 ("net: phylink: ensure PHY momentary link-fails are handled") 7530ea26c810 ("net: phylink: remove "using_mac_select_pcs"") Adjacent changes: drivers/net/ethernet/stmicro/stmmac/dwmac-intel-plat.c 5b366eae7193 ("stmmac: dwmac-intel-plat: fix call balance of tx_clk handling routines") e96321fad3ad ("net: ethernet: Switch back to struct platform_driver::remove()") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11net/mlx5e: SHAMPO, Drop info arrayDragos Tatulea
The info array is used to store a pointer to the dma address of the header and to the frag page. However, this array is not really required: - The frag page can be calculated from the header index frag page index = header index / headers per page. - The dma address can be calculated through a formula: dma page address + header offset. This series gets rid of the info array and uses the above formulas instead. The current_page_index was used in conjunction with the info array to store page fragment indices. This variable is dropped as well. There was no performance regression observed. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107194357.683732-12-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11net/mlx5e: SHAMPO, Fix page_index calculation inconsistencyDragos Tatulea
When calculating the index for the next frag page slot, the divisor is incorrect: it should be the number of pages per queue not the number of headers per queue. This is currently harmless because frag pages are not used directly, but they are intermediated through the info array. But it needs to be fixed as an upcoming patch will get rid of the info array. This patch introduces a new pages per queue variable and plugs it in the formula. Now that this variable exists, additional code can be simplified in the SHAMPO initialization code. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107194357.683732-10-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-11net/mlx5e: clear xdp features on non-uplink representorsWilliam Tu
Non-uplink representor port does not support XDP. The patch clears the xdp feature by checking the net_device_ops.ndo_bpf is set or not. Verify using the netlink tool: $ tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml --dump dev-get Representor netdev before the patch: {'ifindex': 8, 'xdp-features': {'basic', 'ndo-xmit', 'ndo-xmit-sg', 'redirect', 'rx-sg', 'xsk-zerocopy'}, 'xdp-rx-metadata-features': set(), 'xdp-zc-max-segs': 1, 'xsk-features': set()}, With the patch: {'ifindex': 8, 'xdp-features': set(), 'xdp-rx-metadata-features': set(), 'xsk-features': set()}, Fixes: 4d5ab0ad964d ("net/mlx5e: take into account device reconfiguration for xdp_features flag") Signed-off-by: William Tu <witu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241107183527.676877-6-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-03net/mlx5e: do not create xdp_redirect for non-uplink repWilliam Tu
XDP and XDP socket require extra SQ/RQ/CQs. Most of these resources are dynamically created: no XDP program loaded, no resources are created. One exception is the SQ/CQ created for XDP_REDRIECT, used for other netdev to forward packet to mlx5 for transmit. The patch disables creation of SQ and CQ used for egress XDP_REDIRECT, by checking whether ndo_xdp_xmit is set or not. For netdev without XDP support such as non-uplink representor, this saves around 0.35MB of memory, per representor netdevice per channel. Signed-off-by: William Tu <witu@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241031125856.530927-6-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-03net/mlx5e: move XDP_REDIRECT sq to dynamic allocationWilliam Tu
Dynamically allocating xdpsq, used by egress side XDP_REDIRECT. mlx5 has multiple XDP sqs. Under struct mlx5e_channel: 1. rx_xdpsq: used for XDP_TX, an XDP prog handles the rx packet and transmits using the same queue as rx. 2. xdpsq: used by egress side XDP_REDIRECT. This is for another interface to redirect packet to the mlx5 interface, using ndo_xdp_xmit . 3. xsksq: used by XSK. XSK has its own dedicated channel, and it also has resources of 1 and 2. The patch changes only the 2. xdpsq. Signed-off-by: William Tu <witu@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241031125856.530927-5-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-29mlx5: fix typo in "mlx5_cqwq_get_cqe_enahnced_comp"Caleb Sander Mateos
"enahnced" looks to be a misspelling of "enhanced". Rename "mlx5_cqwq_get_cqe_enahnced_comp" to "mlx5_cqwq_get_cqe_enhanced_comp". Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Link: https://patch.msgid.link/20241023164840.140535-1-csander@purestorage.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-29net/mlx5e: Update features on MTU changeDragos Tatulea
When the MTU changes successfully, trigger netdev_update_features() to enable features in wanted state if applicable. An example of such scenario: $ ip link set dev eth1 up $ ethtool --set-ring eth1 rx 8192 $ ip link set dev eth1 mtu 9000 $ ethtool --features eth1 rx-gro-hw on --> fails $ ip link set dev eth1 mtu 7000 With this patch, HW GRO will be turned on automatically because it is set in the device's wanted_features. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20241024164134.299646-2-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netPaolo Abeni
Cross-merge networking fixes after downstream PR (net-6.12-rc4). Conflicts: 107a034d5c1e ("net/mlx5: qos: Store rate groups in a qos domain") 1da9cfd6c41c ("net/mlx5: Unregister notifier on eswitch init failure") Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-17net/mlx5e: Don't call cleanup on profile rollback failureCosmin Ratiu
When profile rollback fails in mlx5e_netdev_change_profile, the netdev profile var is left set to NULL. Avoid a crash when unloading the driver by not calling profile->cleanup in such a case. This was encountered while testing, with the original trigger that the wq rescuer thread creation got interrupted (presumably due to Ctrl+C-ing modprobe), which gets converted to ENOMEM (-12) by mlx5e_priv_init, the profile rollback also fails for the same reason (signal still active) so the profile is left as NULL, leading to a crash later in _mlx5e_remove. [ 732.473932] mlx5_core 0000:08:00.1: E-Switch: Unload vfs: mode(OFFLOADS), nvfs(2), necvfs(0), active vports(2) [ 734.525513] workqueue: Failed to create a rescuer kthread for wq "mlx5e": -EINTR [ 734.557372] mlx5_core 0000:08:00.1: mlx5e_netdev_init_profile:6235:(pid 6086): mlx5e_priv_init failed, err=-12 [ 734.559187] mlx5_core 0000:08:00.1 eth3: mlx5e_netdev_change_profile: new profile init failed, -12 [ 734.560153] workqueue: Failed to create a rescuer kthread for wq "mlx5e": -EINTR [ 734.589378] mlx5_core 0000:08:00.1: mlx5e_netdev_init_profile:6235:(pid 6086): mlx5e_priv_init failed, err=-12 [ 734.591136] mlx5_core 0000:08:00.1 eth3: mlx5e_netdev_change_profile: failed to rollback to orig profile, -12 [ 745.537492] BUG: kernel NULL pointer dereference, address: 0000000000000008 [ 745.538222] #PF: supervisor read access in kernel mode <snipped> [ 745.551290] Call Trace: [ 745.551590] <TASK> [ 745.551866] ? __die+0x20/0x60 [ 745.552218] ? page_fault_oops+0x150/0x400 [ 745.555307] ? exc_page_fault+0x79/0x240 [ 745.555729] ? asm_exc_page_fault+0x22/0x30 [ 745.556166] ? mlx5e_remove+0x6b/0xb0 [mlx5_core] [ 745.556698] auxiliary_bus_remove+0x18/0x30 [ 745.557134] device_release_driver_internal+0x1df/0x240 [ 745.557654] bus_remove_device+0xd7/0x140 [ 745.558075] device_del+0x15b/0x3c0 [ 745.558456] mlx5_rescan_drivers_locked.part.0+0xb1/0x2f0 [mlx5_core] [ 745.559112] mlx5_unregister_device+0x34/0x50 [mlx5_core] [ 745.559686] mlx5_uninit_one+0x46/0xf0 [mlx5_core] [ 745.560203] remove_one+0x4e/0xd0 [mlx5_core] [ 745.560694] pci_device_remove+0x39/0xa0 [ 745.561112] device_release_driver_internal+0x1df/0x240 [ 745.561631] driver_detach+0x47/0x90 [ 745.562022] bus_remove_driver+0x84/0x100 [ 745.562444] pci_unregister_driver+0x3b/0x90 [ 745.562890] mlx5_cleanup+0xc/0x1b [mlx5_core] [ 745.563415] __x64_sys_delete_module+0x14d/0x2f0 [ 745.563886] ? kmem_cache_free+0x1b0/0x460 [ 745.564313] ? lockdep_hardirqs_on_prepare+0xe2/0x190 [ 745.564825] do_syscall_64+0x6d/0x140 [ 745.565223] entry_SYSCALL_64_after_hwframe+0x4b/0x53 [ 745.565725] RIP: 0033:0x7f1579b1288b Fixes: 3ef14e463f6e ("net/mlx5e: Separate between netdev objects and mlx5e profiles initialization") Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-14mlx5: Add support for persistent NAPI configJoe Damato
Use netif_napi_add_config to assign persistent per-NAPI config when initializing NAPIs. Signed-off-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20241011184527.16393-9-jdamato@fastly.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-12net/mlx5e: Match cleanup order in mlx5e_free_rq in reverse of mlx5e_alloc_rqRahul Rameshbabu
mlx5e_free_rq previously cleaned resources in an order that was not the reverse of the resource allocation order in mlx5e_alloc_rq. Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20240911201757.1505453-16-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-03netdev_features: convert NETIF_F_NETNS_LOCAL to dev->netns_localAlexander Lobakin
"Interface can't change network namespaces" is rather an attribute, not a feature, and it can't be changed via Ethtool. Make it a "cold" private flag instead of a netdev_feature and free one more bit. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-08-16net/mlx5e: XPS, Fix oversight of Multi-PF Netdev changesCarolina Jubran
The offending commit overlooked the Multi-PF Netdev changes. Revert mlx5e_set_default_xps_cpumasks to incorporate Multi-PF Netdev changes. Fixes: bcee093751f8 ("net/mlx5e: Modifying channels number and updating TX queues") Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20240815071611.2211873-4-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-16net/mlx5e: SHAMPO, Release in progress headersDragos Tatulea
The change in the fixes tag cleaned up too much: it removed the part that was releasing header pages that were posted via UMR but haven't been acknowledged yet on the ICOSQ. This patch corrects this omission by setting the bits between pi and ci to on when shutting down a queue with SHAMPO. To be consistent with the Striding RQ code, this action is done in mlx5e_free_rx_missing_descs(). Fixes: e839ac9a89cb ("net/mlx5e: SHAMPO, Simplify header page release in teardown") Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20240815071611.2211873-3-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-09net/mlx5e: Fix queue stats access to non-existing channels splatGal Pressman
The queue stats API queries the queues according to the real_num_[tr]x_queues, in case the device is down and channels were not yet created, don't try to query their statistics. To trigger the panic, run this command before the interface is brought up: ./cli.py --spec ../../../Documentation/netlink/specs/netdev.yaml --dump qstats-get --json '{"ifindex": 4}' BUG: kernel NULL pointer dereference, address: 0000000000000c00 PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP PTI CPU: 3 UID: 0 PID: 977 Comm: python3 Not tainted 6.10.0+ #40 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:mlx5e_get_queue_stats_rx+0x3c/0xb0 [mlx5_core] Code: fc 55 48 63 ee 53 48 89 d3 e8 40 3d 70 e1 85 c0 74 58 4c 89 ef e8 d4 07 04 00 84 c0 75 41 49 8b 84 24 f8 39 00 00 48 8b 04 e8 <48> 8b 90 00 0c 00 00 48 03 90 40 0a 00 00 48 89 53 08 48 8b 90 08 RSP: 0018:ffff888116be37d0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff888116be3868 RCX: 0000000000000004 RDX: ffff88810ada4000 RSI: 0000000000000000 RDI: ffff888109df09c0 RBP: 0000000000000000 R08: 0000000000000004 R09: 0000000000000004 R10: ffff88813461901c R11: ffffffffffffffff R12: ffff888109df0000 R13: ffff888109df09c0 R14: ffff888116be38d0 R15: 0000000000000000 FS: 00007f4375d5c740(0000) GS:ffff88852c980000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000c00 CR3: 0000000106ada006 CR4: 0000000000370eb0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ? __die+0x1f/0x60 ? page_fault_oops+0x14e/0x3d0 ? exc_page_fault+0x73/0x130 ? asm_exc_page_fault+0x22/0x30 ? mlx5e_get_queue_stats_rx+0x3c/0xb0 [mlx5_core] netdev_nl_stats_by_netdev+0x2a6/0x4c0 ? __rmqueue_pcplist+0x351/0x6f0 netdev_nl_qstats_get_dumpit+0xc4/0x1b0 genl_dumpit+0x2d/0x80 netlink_dump+0x199/0x410 __netlink_dump_start+0x1aa/0x2c0 genl_family_rcv_msg_dumpit+0x94/0xf0 ? __pfx_genl_start+0x10/0x10 ? __pfx_genl_dumpit+0x10/0x10 ? __pfx_genl_done+0x10/0x10 genl_rcv_msg+0x116/0x2b0 ? __pfx_netdev_nl_qstats_get_dumpit+0x10/0x10 ? __pfx_genl_rcv_msg+0x10/0x10 netlink_rcv_skb+0x54/0x100 genl_rcv+0x24/0x40 netlink_unicast+0x21a/0x340 netlink_sendmsg+0x1f4/0x440 __sys_sendto+0x1b6/0x1c0 ? do_sock_setsockopt+0xc3/0x180 ? __sys_setsockopt+0x60/0xb0 __x64_sys_sendto+0x20/0x30 do_syscall_64+0x50/0x110 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f43757132b0 Code: c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 1d 45 31 c9 45 31 c0 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 68 c3 0f 1f 80 00 00 00 00 41 54 48 83 ec 20 RSP: 002b:00007ffd258da048 EFLAGS: 00000246 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 00007ffd258da0f8 RCX: 00007f43757132b0 RDX: 000000000000001c RSI: 00007f437464b850 RDI: 0000000000000003 RBP: 00007f4375085de0 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007f43751a6147 </TASK> Modules linked in: netconsole xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter rpcsec_gss_krb5 auth_rpcgss oid_registry overlay rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib ib_uverbs ib_core zram zsmalloc mlx5_core fuse [last unloaded: netconsole] CR2: 0000000000000c00 ---[ end trace 0000000000000000 ]--- RIP: 0010:mlx5e_get_queue_stats_rx+0x3c/0xb0 [mlx5_core] Code: fc 55 48 63 ee 53 48 89 d3 e8 40 3d 70 e1 85 c0 74 58 4c 89 ef e8 d4 07 04 00 84 c0 75 41 49 8b 84 24 f8 39 00 00 48 8b 04 e8 <48> 8b 90 00 0c 00 00 48 03 90 40 0a 00 00 48 89 53 08 48 8b 90 08 RSP: 0018:ffff888116be37d0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff888116be3868 RCX: 0000000000000004 RDX: ffff88810ada4000 RSI: 0000000000000000 RDI: ffff888109df09c0 RBP: 0000000000000000 R08: 0000000000000004 R09: 0000000000000004 R10: ffff88813461901c R11: ffffffffffffffff R12: ffff888109df0000 R13: ffff888109df09c0 R14: ffff888116be38d0 R15: 0000000000000000 FS: 00007f4375d5c740(0000) GS:ffff88852c980000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000c00 CR3: 0000000106ada006 CR4: 0000000000370eb0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Fixes: 7b66ae536a78 ("net/mlx5e: Add per queue netdev-genl stats") Signed-off-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Joe Damato <jdamato@fastly.com> Link: https://patch.msgid.link/20240808144107.2095424-6-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-09net/mlx5e: SHAMPO, Increase timeout to improve latencyDragos Tatulea
During latency tests (netperf TCP_RR) a 30% degradation of HW GRO vs SW GRO was observed. This is due to SHAMPO triggering timeout filler CQEs instead of delivering the CQE for the packet. Having a short timeout for SHAMPO doesn't bring any benefits as it is the driver that does the merging, not the hardware. On the contrary, it can have a negative impact: additional filler CQEs are generated due to the timeout. As there is no way to disable this timeout, this change sets it to the maximum value. Instead of using the packet_merge.timeout parameter which is also used for LRO, set the value directly when filling in the rest of the SHAMPO parameters in mlx5e_build_rq_param(). Fixes: 99be56171fa9 ("net/mlx5e: SHAMPO, Re-enable HW-GRO") Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20240808144107.2095424-3-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-13eth: mlx5: expose NETIF_F_NTUPLE when ARFS is compiled outJakub Kicinski
ARFS depends on NTUPLE filters, but the inverse is not true. Drivers which don't support ARFS commonly still support NTUPLE filtering. mlx5 has a Kconfig option to disable ARFS (MLX5_EN_ARFS) and does not advertise NTUPLE filters as a feature at all when ARFS is compiled out. That's not correct, ntuple filters indeed still work just fine (as long as MLX5_EN_RXNFC is enabled). This is needed to make the RSS test not skip all RSS context related testing. Acked-by: Saeed Mahameed <saeedm@nvidia.com> Link: https://patch.msgid.link/20240711223722.297676-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/phy/aquantia/aquantia.h 219343755eae ("net: phy: aquantia: add missing include guards") 61578f679378 ("net: phy: aquantia: add support for PHY LEDs") drivers/net/ethernet/wangxun/libwx/wx_hw.c bd07a9817846 ("net: txgbe: remove separate irq request for MSI and INTx") b501d261a5b3 ("net: txgbe: add FDIR ATR support") https://lore.kernel.org/all/20240703112936.483c1975@canb.auug.org.au/ include/linux/mlx5/mlx5_ifc.h 048a403648fc ("net/mlx5: IFC updates for changing max EQs") 99be56171fa9 ("net/mlx5e: SHAMPO, Re-enable HW-GRO") https://lore.kernel.org/all/20240701133951.6926b2e3@canb.auug.org.au/ Adjacent changes: drivers/net/wireless/intel/iwlwifi/mvm/mac80211.c 4130c67cd123 ("wifi: iwlwifi: mvm: check vif for NULL/ERR_PTR before dereference") 3f3126515fbe ("wifi: iwlwifi: mvm: add mvm-specific guard") include/net/mac80211.h 816c6bec09ed ("wifi: mac80211: fix BSS_CHANGED_UNSOL_BCAST_PROBE_RESP") 5a009b42e041 ("wifi: mac80211: track changes in AP's TPE") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-28net/mlx5e: Add mqprio_rl cleanup and free in mlx5e_priv_cleanup()Jianbo Liu
In the cited commit, mqprio_rl cleanup and free are mistakenly removed in mlx5e_priv_cleanup(), and it causes the leakage of host memory and firmware SCHEDULING_ELEMENT objects while changing eswitch mode. So, add them back. Fixes: 0bb7228f7096 ("net/mlx5e: Fix mqprio_rl handling on devlink reload") Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-17net/mlx5e: Add per queue netdev-genl statsJoe Damato
./cli.py --spec netlink/specs/netdev.yaml \ --dump qstats-get --json '{"scope": "queue"}' ...snip {'ifindex': 7, 'queue-id': 62, 'queue-type': 'rx', 'rx-alloc-fail': 0, 'rx-bytes': 105965251, 'rx-packets': 179790}, {'ifindex': 7, 'queue-id': 0, 'queue-type': 'tx', 'tx-bytes': 9402665, 'tx-packets': 17551}, ...snip Also tested with the script tools/testing/selftests/drivers/net/stats.py in several scenarios to ensure stats tallying was correct: - on boot (default queue counts) - adjusting queue count up or down (ethtool -L eth0 combined ...) The tools/testing/selftests/drivers/net/stats.py brings the device up, so to test with the device down, I did the following: $ ip link show eth4 7: eth4: <BROADCAST,MULTICAST> mtu 9000 qdisc mq state DOWN [..snip..] [..snip..] $ cat /proc/net/dev | grep eth4 eth4: 235710489 434811 [..snip rx..] 2878744 21227 [..snip tx..] $ ./cli.py --spec ../../../Documentation/netlink/specs/netdev.yaml \ --dump qstats-get --json '{"ifindex": 7}' [{'ifindex': 7, 'rx-alloc-fail': 0, 'rx-bytes': 235710489, 'rx-packets': 434811, 'tx-bytes': 2878744, 'tx-packets': 21227}] Compare the values in /proc/net/dev match the output of cli for the same device, even while the device is down. Note that while the device is down, per queue stats output nothing (because the device is down there are no queues): $ ./cli.py --spec ../../../Documentation/netlink/specs/netdev.yaml \ --dump qstats-get --json '{"scope": "queue", "ifindex": 7}' [] Signed-off-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-17net/mlx5e: Add txq to sq stats mappingJoe Damato
mlx5 currently maps txqs to an sq via priv->txq2sq. It is useful to map txqs to sq_stats, as well, for direct access to stats. Add priv->txq2sq_stats and insert mappings. The mappings will be used next to tabulate stats information. Signed-off-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-14net/mlx5e: Fix outdated comment in features checkGal Pressman
The code no longer treats only UDP tunnels, adjust the outdated comment. Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240613210036.1125203-5-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. No conflicts, no adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-10net/mlx5e: Fix features validation check for tunneled UDP (non-VXLAN) packetsGal Pressman
Move the vxlan_features_check() call to after we verified the packet is a tunneled VXLAN packet. Without this, tunneled UDP non-VXLAN packets (for ex. GENENVE) might wrongly not get offloaded. In some cases, it worked by chance as GENEVE header is the same size as VXLAN, but it is obviously incorrect. Fixes: e3cfc7e6b7bd ("net/mlx5e: TX, Add geneve tunnel stateless offload support") Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05net/mlx5e: SHAMPO, Re-enable HW-GROYoray Zack
Add back HW-GRO to the reported features. As the current implementation of HW-GRO uses KSMs with a specific fixed buffer size (256B) to map its headers buffer, we reported the feature only if the NIC is supporting KSM and the minimum value for buffer size is below the requested one. iperf3 bandwidth comparison: +---------+--------+--------+-----------+ | streams | SW GRO | HW GRO | Unit | |---------+--------+--------+-----------| | 1 | 36 | 42 | Gbits/sec | | 4 | 34 | 39 | Gbits/sec | | 8 | 31 | 35 | Gbits/sec | +---------+--------+--------+-----------+ A downstream patch will add skb fragment coalescing which will improve performance considerably. Benchmark details: VM based setup CPU: Intel(R) Xeon(R) Platinum 8380 CPU, 24 cores NIC: ConnectX-7 100GbE iperf3 and irq running on same CPU over a single receive queue Signed-off-by: Yoray Zack <yorayz@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240603212219.1037656-14-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-05net/mlx5e: SHAMPO, Use KSMs instead of KLMsYoray Zack
KSM Mkey is KLM Mkey with a fixed buffer size. Due to this fact, it is a faster mechanism than KLM. SHAMPO feature used KLMs Mkeys for memory mappings of its headers buffer. As it used KLMs with the same buffer size for each entry, we can use KSMs instead. This commit changes the Mkeys that map the SHAMPO headers buffer from KLMs to KSMs. Signed-off-by: Yoray Zack <yorayz@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240603212219.1037656-13-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-05net/mlx5e: SHAMPO, Simplify header page release in teardownDragos Tatulea
The function that releases SHAMPO header pages (mlx5e_shampo_dealloc_hd) has some complicated logic that comes from the fact that it is called twice during teardown: 1) To release the posted header pages that didn't get any completions. 2) To release all remaining header pages. This flow is not necessary: all header pages can be released from the driver side in one go. Furthermore, the above flow is buggy. Taking the 8 headers per page example: 1) Release fragments 5-7. Page will be released. 2) Release remaining fragments 0-4. The bits in the header will indicate that the page needs releasing. But this is incorrect: page was released in step 1. This patch releases all header pages in one go. This simplifies the header page cleanup function. For consistency, the datapath header page release API (mlx5e_free_rx_shampo_hd_entry()) is used. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240603212219.1037656-7-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-05net/mlx5e: SHAMPO, Fix FCS config when HW GRO onDragos Tatulea
For the following scenario: ethtool --features eth3 rx-gro-hw on ethtool --features eth3 rx-fcs on ethtool --features eth3 rx-fcs off ... there is a firmware error because the driver enables HW GRO first while FCS is still enabled. This patch fixes this by swapping the order of HW GRO and FCS for this specific case. Take LRO into consideration as well for consistency. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240603212219.1037656-5-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-24net/mlx5e: Use rx_missed_errors instead of rx_dropped for reporting buffer ↵Carolina Jubran
exhaustion Previously, the driver incorrectly used rx_dropped to report device buffer exhaustion. According to the documentation, rx_dropped should not be used to count packets dropped due to buffer exhaustion, which is the purpose of rx_missed_errors. Use rx_missed_errors as intended for counting packets dropped due to buffer exhaustion. Fixes: 269e6b3af3bf ("net/mlx5e: Report additional error statistics in get stats ndo") Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Merge in late fixes to prepare for the 6.10 net-next PR. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-13net/mlx5e: Modifying channels number and updating TX queuesCarolina Jubran
It is not appropriate for the mlx5e_num_channels_changed function to be called solely for updating the TX queues, even if the channels number has not been changed. Move the code responsible for updating the TC and TX queues from mlx5e_num_channels_changed and produce a new function called mlx5e_update_tc_and_tx_queues. This new function should only be called when the channels number remains unchanged. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240512124306.740898-3-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-10net/mlx5e: Fix netif state handlingShay Drory
mlx5e_suspend cleans resources only if netif_device_present() returns true. However, mlx5e_resume changes the state of netif, via mlx5e_nic_enable, only if reg_state == NETREG_REGISTERED. In the below case, the above leads to NULL-ptr Oops[1] and memory leaks: mlx5e_probe _mlx5e_resume mlx5e_attach_netdev mlx5e_nic_enable <-- netdev not reg, not calling netif_device_attach() register_netdev <-- failed for some reason. ERROR_FLOW: _mlx5e_suspend <-- netif_device_present return false, resources aren't freed :( Hence, clean resources in this case as well. [1] BUG: kernel NULL pointer dereference, address: 0000000000000000 PGD 0 P4D 0 Oops: 0010 [#1] SMP CPU: 2 PID: 9345 Comm: test-ovs-ct-gen Not tainted 6.5.0_for_upstream_min_debug_2023_09_05_16_01 #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:0x0 Code: Unable to access opcode bytes at0xffffffffffffffd6. RSP: 0018:ffff888178aaf758 EFLAGS: 00010246 Call Trace: <TASK> ? __die+0x20/0x60 ? page_fault_oops+0x14c/0x3c0 ? exc_page_fault+0x75/0x140 ? asm_exc_page_fault+0x22/0x30 notifier_call_chain+0x35/0xb0 blocking_notifier_call_chain+0x3d/0x60 mlx5_blocking_notifier_call_chain+0x22/0x30 [mlx5_core] mlx5_core_uplink_netdev_event_replay+0x3e/0x60 [mlx5_core] mlx5_mdev_netdev_track+0x53/0x60 [mlx5_ib] mlx5_ib_roce_init+0xc3/0x340 [mlx5_ib] __mlx5_ib_add+0x34/0xd0 [mlx5_ib] mlx5r_probe+0xe1/0x210 [mlx5_ib] ? auxiliary_match_id+0x6a/0x90 auxiliary_bus_probe+0x38/0x80 ? driver_sysfs_add+0x51/0x80 really_probe+0xc9/0x3e0 ? driver_probe_device+0x90/0x90 __driver_probe_device+0x80/0x160 driver_probe_device+0x1e/0x90 __device_attach_driver+0x7d/0x100 bus_for_each_drv+0x80/0xd0 __device_attach+0xbc/0x1f0 bus_probe_device+0x86/0xa0 device_add+0x637/0x840 __auxiliary_device_add+0x3b/0xa0 add_adev+0xc9/0x140 [mlx5_core] mlx5_rescan_drivers_locked+0x22a/0x310 [mlx5_core] mlx5_register_device+0x53/0xa0 [mlx5_core] mlx5_init_one_devl_locked+0x5c4/0x9c0 [mlx5_core] mlx5_init_one+0x3b/0x60 [mlx5_core] probe_one+0x44c/0x730 [mlx5_core] local_pci_probe+0x3e/0x90 pci_device_probe+0xbf/0x210 ? kernfs_create_link+0x5d/0xa0 ? sysfs_do_create_link_sd+0x60/0xc0 really_probe+0xc9/0x3e0 ? driver_probe_device+0x90/0x90 __driver_probe_device+0x80/0x160 driver_probe_device+0x1e/0x90 __device_attach_driver+0x7d/0x100 bus_for_each_drv+0x80/0xd0 __device_attach+0xbc/0x1f0 pci_bus_add_device+0x54/0x80 pci_iov_add_virtfn+0x2e6/0x320 sriov_enable+0x208/0x420 mlx5_core_sriov_configure+0x9e/0x200 [mlx5_core] sriov_numvfs_store+0xae/0x1a0 kernfs_fop_write_iter+0x10c/0x1a0 vfs_write+0x291/0x3c0 ksys_write+0x5f/0xe0 do_syscall_64+0x3d/0x90 entry_SYSCALL_64_after_hwframe+0x46/0xb0 CR2: 0000000000000000 ---[ end trace 0000000000000000 ]--- Fixes: 2c3b5beec46a ("net/mlx5e: More generic netdev management API") Signed-off-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240509112951.590184-2-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-07net: annotate writes on dev->mtu from ndo_change_mtu()Eric Dumazet
Simon reported that ndo_change_mtu() methods were never updated to use WRITE_ONCE(dev->mtu, new_mtu) as hinted in commit 501a90c94510 ("inet: protect against too small mtu values.") We read dev->mtu without holding RTNL in many places, with READ_ONCE() annotations. It is time to take care of ndo_change_mtu() methods to use corresponding WRITE_ONCE() Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Simon Horman <horms@kernel.org> Closes: https://lore.kernel.org/netdev/20240505144608.GB67882@kernel.org/ Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Simon Horman <horms@kernel.org> Acked-by: Shannon Nelson <shannon.nelson@amd.com> Link: https://lore.kernel.org/r/20240506102812.3025432-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-22net/mlx5e: Support updating coalescing configuration without resetting channelsRahul Rameshbabu
When CQE mode or DIM state is changed, gracefully reconfigure channels to handle new configuration. Previously, would create new channels that would reflect the changes rather than update the original channels. Co-developed-by: Nabil S. Alramli <dev@nalramli.com> Signed-off-by: Nabil S. Alramli <dev@nalramli.com> Co-developed-by: Joe Damato <jdamato@fastly.com> Signed-off-by: Joe Damato <jdamato@fastly.com> Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20240419080445.417574-5-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>