summaryrefslogtreecommitdiff
path: root/drivers/net/ethernet/mellanox/mlxsw
AgeCommit message (Collapse)Author
2024-07-31mlxsw: core_thermal: Remove unused argumentsIdo Schimmel
'dev' and 'core' arguments are not used by mlxsw_thermal_module_init(). Remove them. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Vadim Pasternak <vadimp@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Link: https://patch.msgid.link/563fc7383f61809a306b9954872219eaaf3c689b.1722345311.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-31mlxsw: core_thermal: Fold two loops into oneIdo Schimmel
There is no need to traverse the same array twice. Do it once by folding both loops into one. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Vadim Pasternak <vadimp@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Link: https://patch.msgid.link/81756744ed532aaa9249a83fc08757accfe8b07c.1722345311.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-31mlxsw: core_thermal: Remove another unnecessary checkIdo Schimmel
mlxsw_thermal_modules_init() allocates an array of modules and then initializes each entry by calling mlxsw_thermal_module_init() which among other things initializes the 'parent' pointer of the entry. mlxsw_thermal_modules_init() then traverses over the array again, but skips over entries that do not have their 'parent' pointer set which is impossible given the above. Therefore, remove the unnecessary check. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Vadim Pasternak <vadimp@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Link: https://patch.msgid.link/fb3e8ded422a441436431d5785b900f11ffc9621.1722345311.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-31mlxsw: core_thermal: Remove unnecessary checkIdo Schimmel
mlxsw_thermal_modules_init() allocates an array of modules and then calls mlxsw_thermal_module_init() to initialize each entry in the array. It is therefore impossible for mlxsw_thermal_module_init() to encounter an entry that is already initialized and has its 'parent' pointer set. Remove the unnecessary check. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Vadim Pasternak <vadimp@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-31mlxsw: core_thermal: Call thermal_zone_device_unregister() unconditionallyIdo Schimmel
The function returns immediately if the thermal zone pointer is NULL so there is no need to check it before calling the function. Remove the check. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Vadim Pasternak <vadimp@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Link: https://patch.msgid.link/0bd251aa8ce03d3c951983aa6b4300d8205b88a7.1722345311.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-15Merge branch 'net-make-timestamping-selectable'Jakub Kicinski
First part of "net: Make timestamping selectable" from Kory Maincent. Change the driver-facing type already to lower rebasing pain. Link: https://lore.kernel.org/20240709-feature_ptp_netnext-v17-0-b5317f50df2a@bootlin.com/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-15net: Add struct kernel_ethtool_ts_infoKory Maincent
In prevision to add new UAPI for hwtstamp we will be limited to the struct ethtool_ts_info that is currently passed in fixed binary format through the ETHTOOL_GET_TS_INFO ethtool ioctl. It would be good if new kernel code already started operating on an extensible kernel variant of that structure, similar in concept to struct kernel_hwtstamp_config vs struct hwtstamp_config. Since struct ethtool_ts_info is in include/uapi/linux/ethtool.h, here we introduce the kernel-only structure in include/linux/ethtool.h. The manual copy is then made in the function called by ETHTOOL_GET_TS_INFO. Acked-by: Shannon Nelson <shannon.nelson@amd.com> Acked-by: Alexandra Winter <wintera@linux.ibm.com> Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Link: https://patch.msgid.link/20240709-feature_ptp_netnext-v17-6-b5317f50df2a@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-09mlxsw: pci: Lock configuration space of upstream bridge during resetIdo Schimmel
The driver triggers a "Secondary Bus Reset" (SBR) by calling __pci_reset_function_locked() which asserts the SBR bit in the "Bridge Control Register" in the configuration space of the upstream bridge for 2ms. This is done without locking the configuration space of the upstream bridge port, allowing user space to access it concurrently. Linux 6.11 will start warning about such unlocked resets [1][2]: pcieport 0000:00:01.0: unlocked secondary bus reset via: pci_reset_bus_function+0x51c/0x6a0 Avoid the warning and the concurrent access by locking the configuration space of the upstream bridge prior to the reset and unlocking it afterwards. [1] https://lore.kernel.org/all/171711746953.1628941.4692125082286867825.stgit@dwillia2-xfh.jf.intel.com/ [2] https://lore.kernel.org/all/20240531213150.GA610983@bhelgaas/ Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Link: https://patch.msgid.link/9937b0afdb50f2f2825945393c94c093c04a5897.1720447210.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-09mlxsw: core_thermal: Report valid current state during cooling device ↵Ido Schimmel
registration Commit 31a0fa0019b0 ("thermal/debugfs: Pass cooling device state to thermal_debug_cdev_add()") changed the thermal core to read the current state of the cooling device as part of the cooling device's registration. This is incompatible with the current implementation of the cooling device operations in mlxsw, leading to initialization failure with errors such as: mlxsw_spectrum 0000:01:00.0: Failed to register cooling device mlxsw_spectrum 0000:01:00.0: cannot register bus device The reason for the failure is that when the get current state operation is invoked the driver tries to derive the index of the cooling device by walking a per thermal zone array and looking for the matching cooling device pointer. However, the pointer is returned from the registration function and therefore only set in the array after the registration. The issue was later fixed by commit 1af89dedc8a5 ("thermal: core: Do not fail cdev registration because of invalid initial state") by not failing the registration of the cooling device if it cannot report a valid current state during registration, although drivers are responsible for ensuring that this will not happen. Therefore, make sure the driver is able to report a valid current state for the cooling device during registration by passing to the registration function a per cooling device private data that already has the cooling device index populated. While at it, call thermal_cooling_device_unregister() unconditionally since the function returns immediately if the cooling device pointer is NULL. Reviewed-by: Vadim Pasternak <vadimp@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Link: https://patch.msgid.link/c823c4678b6b7afb902c35b3551c81a053afd110.1720447210.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-09mlxsw: Warn about invalid accesses to array fieldsPetr Machata
A forgotten or buggy variable initialization can cause out-of-bounds access to a register or other item array field. For an overflow, such access would mangle adjacent parts of the register payload. For an underflow, due to all variables being unsigned, the access would likely trample unrelated memory. Since neither is correct, replace these accesses with accesses at the index of 0, and warn about the issue. Suggested-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Link: https://patch.msgid.link/b988fb265c2f6c1206fe12d5bfdcfa188b7672d1.1720447210.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/phy/aquantia/aquantia.h 219343755eae ("net: phy: aquantia: add missing include guards") 61578f679378 ("net: phy: aquantia: add support for PHY LEDs") drivers/net/ethernet/wangxun/libwx/wx_hw.c bd07a9817846 ("net: txgbe: remove separate irq request for MSI and INTx") b501d261a5b3 ("net: txgbe: add FDIR ATR support") https://lore.kernel.org/all/20240703112936.483c1975@canb.auug.org.au/ include/linux/mlx5/mlx5_ifc.h 048a403648fc ("net/mlx5: IFC updates for changing max EQs") 99be56171fa9 ("net/mlx5e: SHAMPO, Re-enable HW-GRO") https://lore.kernel.org/all/20240701133951.6926b2e3@canb.auug.org.au/ Adjacent changes: drivers/net/wireless/intel/iwlwifi/mvm/mac80211.c 4130c67cd123 ("wifi: iwlwifi: mvm: check vif for NULL/ERR_PTR before dereference") 3f3126515fbe ("wifi: iwlwifi: mvm: add mvm-specific guard") include/net/mac80211.h 816c6bec09ed ("wifi: mac80211: fix BSS_CHANGED_UNSOL_BCAST_PROBE_RESP") 5a009b42e041 ("wifi: mac80211: track changes in AP's TPE") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-04mlxsw: core_linecards: Fix double memory deallocation in case of invalid INI ↵Aleksandr Mishin
file In case of invalid INI file mlxsw_linecard_types_init() deallocates memory but doesn't reset pointer to NULL and returns 0. In case of any error occurred after mlxsw_linecard_types_init() call, mlxsw_linecards_init() calls mlxsw_linecard_types_fini() which performs memory deallocation again. Add pointer reset to NULL. Found by Linux Verification Center (linuxtesting.org) with SVACE. Fixes: b217127e5e4e ("mlxsw: core_linecards: Add line card objects and implement provisioning") Signed-off-by: Aleksandr Mishin <amishin@t-argos.ru> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Link: https://patch.msgid.link/20240703203251.8871-1-amishin@t-argos.ru Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-28mlxsw: Implement ethtool operation to write to a transceiver module EEPROMIdo Schimmel
Implement the ethtool_ops::set_module_eeprom_by_page operation to allow ethtool to write to a transceiver module EEPROM, in a similar fashion to the ethtool_ops::get_module_eeprom_by_page operation. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: e3f02f32a050 ("ionic: fix kernel panic due to multi-buffer handling") d9c04209990b ("ionic: Mark error paths in the data path as unlikely") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-26mlxsw: pci: Use fragmented buffersAmit Cohen
WQE (Work Queue Element) includes 3 scatter/gather entries for buffers. The buffer can be split into 3 parts, software should set address and byte count of each part. A previous patch-set used page pool to allocate buffers, to simplify the change, we first used one continuous buffer, which was allocated with order > 0. This patch improves page pool usage to allocate the exact number of pages which are required for packet. As part of init, fill WQE.address[x] and WQE.byte_count* with pages which are allocated from the pool. Fill x entries according to number of scatter/gather entries which are required for maximum packet size. When a packet is received, check the actual size and replace only the used pages. Save bytes for software overhead only as part of the first entry. This change also requires using fragmented SKB, till now all the buffer was in the linear part. Note that 'skb->truesize' is decreased for small packets. For now the maximum buffer size is 3 * PAGE_SIZE which is enough, in case that the driver will support larger MTU, we can use 'order' to allocate more than one page per scatter/gather entry. This change significantly reduces memory consumption of mlxsw driver. The footprint is reduced by 26%. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Link: https://patch.msgid.link/ee38898c692e7f644a7f3ea4d33aeddb4dd917d2.1719321422.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-26mlxsw: pci: Store number of scatter/gather entries for maximum packet sizeAmit Cohen
A previous patch-set used page pool for Rx buffers allocations. To simplify the change, we first used page pool for one allocation per packet - one continuous buffer is allocated for each packet. This can be improved by using fragmented buffers, then memory consumption will be significantly reduced. WQE (Work Queue Element) includes up to 3 scatter/gather entries for data. As preparation for fragmented buffer usage, calculate number of scatter/gather entries which are required for packet according to maximum MTU and store it for future use. For now use PAGE_SIZE for each entry, which means that maximum buffer size is 3 * PAGE_SIZE. This is enough for the maximum MTU which is supported in the driver now (10K). Warn in an unlikely case of maximum MTU which requires more than 3 pages, for now this warn should not happen with standard page size (>=4K) and maximum MTU (10K). Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Link: https://patch.msgid.link/98c3e3adb7e727e571ac538faf67cef262cec4fc.1719321422.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-26net: Drop explicit initialization of struct i2c_device_id::driver_data to 0Uwe Kleine-König
These drivers don't use the driver_data member of struct i2c_device_id, so don't explicitly initialize this member. This prepares putting driver_data in an anonymous union which requires either no initialization or named designators. But it's also a nice cleanup on its own. While add it, also remove commas after the sentinel entries. Signed-off-by: Uwe Kleine-König <u.kleine-koenig@baylibre.com> Reviewed-by: Petr Machata <petrm@nvidia.com> # For mlxsw Reviewed-by: Kory Maincent <Kory.maincent@bootlin.com> Reviewed-by: Jeremy Kerr <jk@codeconstruct.com.au> # for mctp-i2c Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de> Acked-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://patch.msgid.link/20240625083853.2205977-2-u.kleine-koenig@baylibre.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-21mlxsw: spectrum_buffers: Fix memory corruptions on Spectrum-4 systemsIdo Schimmel
The following two shared buffer operations make use of the Shared Buffer Status Register (SBSR): # devlink sb occupancy snapshot pci/0000:01:00.0 # devlink sb occupancy clearmax pci/0000:01:00.0 The register has two masks of 256 bits to denote on which ingress / egress ports the register should operate on. Spectrum-4 has more than 256 ports, so the register was extended by cited commit with a new 'port_page' field. However, when filling the register's payload, the driver specifies the ports as absolute numbers and not relative to the first port of the port page, resulting in memory corruptions [1]. Fix by specifying the ports relative to the first port of the port page. [1] BUG: KASAN: slab-use-after-free in mlxsw_sp_sb_occ_snapshot+0xb6d/0xbc0 Read of size 1 at addr ffff8881068cb00f by task devlink/1566 [...] Call Trace: <TASK> dump_stack_lvl+0xc6/0x120 print_report+0xce/0x670 kasan_report+0xd7/0x110 mlxsw_sp_sb_occ_snapshot+0xb6d/0xbc0 mlxsw_devlink_sb_occ_snapshot+0x75/0xb0 devlink_nl_sb_occ_snapshot_doit+0x1f9/0x2a0 genl_family_rcv_msg_doit+0x20c/0x300 genl_rcv_msg+0x567/0x800 netlink_rcv_skb+0x170/0x450 genl_rcv+0x2d/0x40 netlink_unicast+0x547/0x830 netlink_sendmsg+0x8d4/0xdb0 __sys_sendto+0x49b/0x510 __x64_sys_sendto+0xe5/0x1c0 do_syscall_64+0xc1/0x1d0 entry_SYSCALL_64_after_hwframe+0x77/0x7f [...] Allocated by task 1: kasan_save_stack+0x33/0x60 kasan_save_track+0x14/0x30 __kasan_kmalloc+0x8f/0xa0 copy_verifier_state+0xbc2/0xfb0 do_check_common+0x2c51/0xc7e0 bpf_check+0x5107/0x9960 bpf_prog_load+0xf0e/0x2690 __sys_bpf+0x1a61/0x49d0 __x64_sys_bpf+0x7d/0xc0 do_syscall_64+0xc1/0x1d0 entry_SYSCALL_64_after_hwframe+0x77/0x7f Freed by task 1: kasan_save_stack+0x33/0x60 kasan_save_track+0x14/0x30 kasan_save_free_info+0x3b/0x60 poison_slab_object+0x109/0x170 __kasan_slab_free+0x14/0x30 kfree+0xca/0x2b0 free_verifier_state+0xce/0x270 do_check_common+0x4828/0xc7e0 bpf_check+0x5107/0x9960 bpf_prog_load+0xf0e/0x2690 __sys_bpf+0x1a61/0x49d0 __x64_sys_bpf+0x7d/0xc0 do_syscall_64+0xc1/0x1d0 entry_SYSCALL_64_after_hwframe+0x77/0x7f Fixes: f8538aec88b4 ("mlxsw: Add support for more than 256 ports in SBSR register") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-21mlxsw: pci: Fix driver initialization with Spectrum-4Ido Schimmel
Cited commit added support for a new reset flow ("all reset") which is deeper than the existing reset flow ("software reset") and allows the device's PCI firmware to be upgraded. In the new flow the driver first tells the firmware that "all reset" is required by issuing a new reset command (i.e., MRSR.command=6) and then triggers the reset by having the PCI core issue a secondary bus reset (SBR). However, due to a race condition in the device's firmware the device is not always able to recover from this reset, resulting in initialization failures [1]. New firmware versions include a fix for the bug and advertise it using a new capability bit in the Management Capabilities Mask (MCAM) register. Avoid initialization failures by reading the new capability bit and triggering the new reset flow only if the bit is set. If the bit is not set, trigger a normal PCI hot reset by skipping the call to the Management Reset and Shutdown Register (MRSR). Normal PCI hot reset is weaker than "all reset", but it results in a fully operational driver and allows users to flash a new firmware, if they want to. [1] mlxsw_spectrum4 0000:01:00.0: not ready 1023ms after bus reset; waiting mlxsw_spectrum4 0000:01:00.0: not ready 2047ms after bus reset; waiting mlxsw_spectrum4 0000:01:00.0: not ready 4095ms after bus reset; waiting mlxsw_spectrum4 0000:01:00.0: not ready 8191ms after bus reset; waiting mlxsw_spectrum4 0000:01:00.0: not ready 16383ms after bus reset; waiting mlxsw_spectrum4 0000:01:00.0: not ready 32767ms after bus reset; waiting mlxsw_spectrum4 0000:01:00.0: not ready 65535ms after bus reset; giving up mlxsw_spectrum4 0000:01:00.0: PCI function reset failed with -25 mlxsw_spectrum4 0000:01:00.0: cannot register bus device mlxsw_spectrum4: probe of 0000:01:00.0 failed with error -25 Fixes: f257c73e5356 ("mlxsw: pci: Add support for new reset flow") Reported-by: Maksym Yaremchuk <maksymy@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Maksym Yaremchuk <maksymy@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-19mlxsw: pci: Use napi_consume_skb() to free SKB as part of Tx completionAmit Cohen
Currently, as part of Tx completion, the driver calls dev_kfree_skb_any() to free the SKB. For this flow, the correct function is napi_consume_skb(). This function and dev_consume_skb_any() were added to be used for consumed SKBs, which were not dropped, so the skb:kfree_skb tracepoint is not triggered, and we can get better diagnostics about dropped packets. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/a9f9f3dc884c0d1be4bd4c9d72030c88c7ac004f.1718709196.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-19mlxsw: pci: Do not store SKB for RDQ elementsAmit Cohen
The previous patch used page pool to allocate buffers for RDQ. With this change, 'elem_info->u.rdq.skb' is not used anymore, as we do not allocate SKB before getting the packet, we hold page pointer and build the SKB around it once packet is received. Remove the union and store SKB pointer for SDQ only. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/23a531008936dc9a1a298643fb1e4f9a7b8e6eb3.1718709196.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-19mlxsw: pci: Optimize data buffer accessAmit Cohen
Before accessing data buffer, call net_prefetch() to load it into the cache. This change improves driver performance, CPU can handle about 7.1% more packets per second. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/1fa07c510890866a6f201163ab7e78890ba28b3b.1718709196.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-19mlxsw: pci: Use page pool for Rx buffers allocationAmit Cohen
As part of driver init, all Rx queues are filled with buffers for hardware usage. Later, when a packet is received, a new buffer should be allocated to be used by hardware instead of the received buffer. Packet's processing time includes allocation time, which can be improved using page pool. Using page pool, DMA mapping is done only for first allocation of buffers. As subsequent buffers allocation avoid DMA mapping, it results in performance improvement. The purpose of page pool is to allocate pages fast from cache without locking. This lockless guarantee naturally comes from running under a NAPI. Use page pool to allocate the data buffer only, so hardware will use it to fill the packet. At completion time, attach the data buffer (now filled with packet payload) to new SKB which is allocated around the received buffer. SKB building at completion time prevents cache miss for each packet, as now the SKB is allocated right before packets will be handled by networking stack. Page pool for each Rx queue enhances Rx side performance by reclaiming buffers back to each queue specific pool. This change significantly improves driver performance, CPU can handle about 345% of the packets per second it previously handled. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/1cf788a8f43c70aae6d526018ef77becb27ad6d3.1718709196.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-19mlxsw: pci: Initialize page pool per CQAmit Cohen
Next patch will use page pool to allocate buffers for RDQ. Initialize page pool for each CQ, which is mapped 1:1 to RDQ. Page pool for each Rx queue enhances Rx side performance by reclaiming buffers back to each queue specific pool. When only one NAPI instance is the consumer of pages from page pool, it is recommended to pass it as part of 'page_pool_params', then page pool APIs will be done without special locks. mlxsw driver holds NAPI instance per CQ, so add page pool per CQ and use the existing NAPI instance. For now, pages are not allocated from the pool, next patch will use it. Some notes regarding 'page_pool_params': * Use PP_FLAG_DMA_MAP to allow page pool handles DMA mapping, for now do not use sync flag, as only the device writes to this memory and we read it only when it finishes writing there. This will probably be changed when we will support XDP. * Define 'order' according to maximum MTU and take into account software overhead. Some round up are done, which means that we allocate more pages than we really need. This can be improved later by using fragmented buffers. * Use pool_size = MLXSW_PCI_WQE_COUNT. This will be the size of 'ptr_ring', and should be the maximum amount of packets that page pool will allocate memory for. In our case, this is the queue size, defined as MLXSW_PCI_WQE_COUNT. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/02e5856ae7c572d4293ce6bb92c286ee6cfec800.1718709196.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-19mlxsw: pci: Store CQ pointer as part of RDQ structureAmit Cohen
Next patches will add support for page pool in mlxsw driver. Page pool will be used to allocate buffers for RDQ and will use NAPI instance of the appropriate CQ (RDQ is mapped 1:1 to CQ). To allow pool initialization as part of CQ init, when NAPI is initialized, page_pool structure will be as part of CQ structure. Later, the allocations for RDQ will be done from the pool in the appropriate CQ. To allow access to the appropriate pool, set CQ pointer as part of RDQ initialization. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/d60918ca1e142a554af1df9c1152cdac83854a3b.1718709196.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-19mlxsw: pci: Split NAPI setup/teardown into two stepsAmit Cohen
mlxsw_pci_cq_napi_setup() includes both NAPI initialization and enablement, similar to teardown function. Next patches will add support for page pool in mlxsw driver, then we use NAPI instance for page pool. Page pool initialization should be done before NAPI enablement, same for page pool destruction which should be done after NAPI disablement. As preparation, split NAPI setup/teardown into two steps, then page pool setup will be done between the phases. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/8dbf37e859f07247498fca17109b8858ff2b0498.1718709196.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-14mlxsw: Use the same maximum MTU value throughout the driverAmit Cohen
Currently, the driver uses two different values for maximum MTU, one is stored in mlxsw_port->dev->max_mtu and the second is stored in mlxsw_port->max_mtu. The second one is set to value which is queried from firmware. This value was never tested, and unfortunately is not really supported. That means that with the existing code, user can set MTU to X, which is not really supported by firmware and which is bigger than buffer size which is allocated in pci. To make the driver consistent, use only mlxsw_port->dev->max_mtu for maximum MTU value, for buffers headroom add Ethernet frame headers, which are not included in mlxsw_port->dev->max_mtu. Remove mlxsw_port->max_mtu. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Link: https://lore.kernel.org/r/89fa6f804386b918d337e736e14ac291bb947483.1718275854.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-14mlxsw: spectrum: Set more accurate values for netdevice min/max MTUAmit Cohen
Currently, the driver uses ETH_MAX_MTU as maximum MTU of netdevices, instead, use the accurate value which is supported by the driver. Subtract Ethernet headers which are taken into account by hardware for MTU checking, as described in the previous patch. Set minimum MTU to ETH_MIN_MTU, as zero MTU is not really supported. With this change: a. The stack will do the MTU checking, so we can remove it from the driver. b. User space will be able to query the actual MTU limits. Before this patch: $ ip -j -d link show dev swp1 | jq | grep mtu "mtu": 1500, "min_mtu": 0, "max_mtu": 65535, With this patch: $ ip -j -d link show dev swp1 | jq | grep mtu "mtu": 1500, "min_mtu": 68, "max_mtu": 10218, Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Link: https://lore.kernel.org/r/be8232e38c196ecb607f82c5e000ea427ce22abb.1718275854.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-14mlxsw: Adjust MTU value to hardware checkAmit Cohen
Ethernet frame consists of - Ethernet header, payload, FCS. The MTU value which is used by user is the size of the payload, which means that when user sets MTU to X, the total frame size will be larger due to the addition of the Ethernet header and FCS. Spectrum ASICs take into account Ethernet header and FCS as part of packet size for MTU check. Adjust MTU value when user sets MTU, to configure the MTU size which is required by hardware. The Tx header length which was used by the driver is not relevant for such calculation, take into account Ethernet header (with VLAN extension) and FCS. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Link: https://lore.kernel.org/r/f3203c2477bb8ed18b1e79642fa3e3713e1e55bb.1718275854.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-14mlxsw: port: Edit maximum MTU valueAmit Cohen
Currently mlxsw driver supports up to 10000 bytes for maximum MTU, this value is not accurate, we can support up to 10K bytes. Change the value to the maximum supported MTU by firmware. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Link: https://lore.kernel.org/r/666f51681234aeef09d771833ccb6e94bd323c88.1718275854.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-12mlxsw: spectrum_router: Apply user-defined multipath hash seedPetr Machata
When Spectrum machines compute hash for the purposes of ECMP routing, they use a seed specified through RECR_v2 (Router ECMP Configuration Register). Up until now mlxsw computed the seed by hashing the machine's base MAC. Now that we can optionally have a user-provided seed, use that if possible. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240607151357.421181-4-petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-10mlxsw: spectrum_acl: Fix ACL scale regression and firmware errorsIdo Schimmel
ACLs that reside in the algorithmic TCAM (A-TCAM) in Spectrum-2 and newer ASICs can share the same mask if their masks only differ in up to 8 consecutive bits. For example, consider the following filters: # tc filter add dev swp1 ingress pref 1 proto ip flower dst_ip 192.0.2.0/24 action drop # tc filter add dev swp1 ingress pref 1 proto ip flower dst_ip 198.51.100.128/25 action drop The second filter can use the same mask as the first (dst_ip/24) with a delta of 1 bit. However, the above only works because the two filters have different values in the common unmasked part (dst_ip/24). When entries have the same value in the common unmasked part they create undesired collisions in the device since many entries now have the same key. This leads to firmware errors such as [1] and to a reduced scale. Fix by adjusting the hash table key to only include the value in the common unmasked part. That is, without including the delta bits. That way the driver will detect the collision during filter insertion and spill the filter into the circuit TCAM (C-TCAM). Add a test case that fails without the fix and adjust existing cases that check C-TCAM spillage according to the above limitation. [1] mlxsw_spectrum2 0000:06:00.0: EMAD reg access failed (tid=3379b18a00003394,reg_id=3027(ptce3),type=write,status=8(resource not available)) Fixes: c22291f7cf45 ("mlxsw: spectrum: acl: Implement delta for ERP") Reported-by: Alexander Zubkov <green@qrator.net> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Amit Cohen <amcohen@nvidia.com> Tested-by: Alexander Zubkov <green@qrator.net> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-10mlxsw: spectrum_acl_erp: Fix object nesting warningIdo Schimmel
ACLs in Spectrum-2 and newer ASICs can reside in the algorithmic TCAM (A-TCAM) or in the ordinary circuit TCAM (C-TCAM). The former can contain more ACLs (i.e., tc filters), but the number of masks in each region (i.e., tc chain) is limited. In order to mitigate the effects of the above limitation, the device allows filters to share a single mask if their masks only differ in up to 8 consecutive bits. For example, dst_ip/25 can be represented using dst_ip/24 with a delta of 1 bit. The C-TCAM does not have a limit on the number of masks being used (and therefore does not support mask aggregation), but can contain a limited number of filters. The driver uses the "objagg" library to perform the mask aggregation by passing it objects that consist of the filter's mask and whether the filter is to be inserted into the A-TCAM or the C-TCAM since filters in different TCAMs cannot share a mask. The set of created objects is dependent on the insertion order of the filters and is not necessarily optimal. Therefore, the driver will periodically ask the library to compute a more optimal set ("hints") by looking at all the existing objects. When the library asks the driver whether two objects can be aggregated the driver only compares the provided masks and ignores the A-TCAM / C-TCAM indication. This is the right thing to do since the goal is to move as many filters as possible to the A-TCAM. The driver also forbids two identical masks from being aggregated since this can only happen if one was intentionally put in the C-TCAM to avoid a conflict in the A-TCAM. The above can result in the following set of hints: H1: {mask X, A-TCAM} -> H2: {mask Y, A-TCAM} // X is Y + delta H3: {mask Y, C-TCAM} -> H4: {mask Z, A-TCAM} // Y is Z + delta After getting the hints from the library the driver will start migrating filters from one region to another while consulting the computed hints and instructing the device to perform a lookup in both regions during the transition. Assuming a filter with mask X is being migrated into the A-TCAM in the new region, the hints lookup will return H1. Since H2 is the parent of H1, the library will try to find the object associated with it and create it if necessary in which case another hints lookup (recursive) will be performed. This hints lookup for {mask Y, A-TCAM} will either return H2 or H3 since the driver passes the library an object comparison function that ignores the A-TCAM / C-TCAM indication. This can eventually lead to nested objects which are not supported by the library [1]. Fix by removing the object comparison function from both the driver and the library as the driver was the only user. That way the lookup will only return exact matches. I do not have a reliable reproducer that can reproduce the issue in a timely manner, but before the fix the issue would reproduce in several minutes and with the fix it does not reproduce in over an hour. Note that the current usefulness of the hints is limited because they include the C-TCAM indication and represent aggregation that cannot actually happen. This will be addressed in net-next. [1] WARNING: CPU: 0 PID: 153 at lib/objagg.c:170 objagg_obj_parent_assign+0xb5/0xd0 Modules linked in: CPU: 0 PID: 153 Comm: kworker/0:18 Not tainted 6.9.0-rc6-custom-g70fbc2c1c38b #42 Hardware name: Mellanox Technologies Ltd. MSN3700C/VMOD0008, BIOS 5.11 10/10/2018 Workqueue: mlxsw_core mlxsw_sp_acl_tcam_vregion_rehash_work RIP: 0010:objagg_obj_parent_assign+0xb5/0xd0 [...] Call Trace: <TASK> __objagg_obj_get+0x2bb/0x580 objagg_obj_get+0xe/0x80 mlxsw_sp_acl_erp_mask_get+0xb5/0xf0 mlxsw_sp_acl_atcam_entry_add+0xe8/0x3c0 mlxsw_sp_acl_tcam_entry_create+0x5e/0xa0 mlxsw_sp_acl_tcam_vchunk_migrate_one+0x16b/0x270 mlxsw_sp_acl_tcam_vregion_rehash_work+0xbe/0x510 process_one_work+0x151/0x370 Fixes: 9069a3817d82 ("lib: objagg: implement optimization hints assembly and use hints for object creation") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Amit Cohen <amcohen@nvidia.com> Tested-by: Alexander Zubkov <green@qrator.net> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-10mlxsw: spectrum_acl_atcam: Fix wrong commentIdo Schimmel
The key is encoded, not encrypted. Fixes: c22291f7cf45 ("mlxsw: spectrum: acl: Implement delta for ERP") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Amit Cohen <amcohen@nvidia.com> Tested-by: Alexander Zubkov <green@qrator.net> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-06-05mlxsw: spectrum_router: Constify struct devlink_dpipe_table_opsChristophe JAILLET
'struct devlink_dpipe_table_ops' are not modified in this driver. Constifying these structures moves some data to a read-only section, so increase overall security. On a x86_64, with allmodconfig: Before: ====== text data bss dec hex filename 15557 712 0 16269 3f8d drivers/net/ethernet/mellanox/mlxsw/spectrum_dpipe.o After: ===== text data bss dec hex filename 15789 488 0 16277 3f95 drivers/net/ethernet/mellanox/mlxsw/spectrum_dpipe.o Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-07net: annotate writes on dev->mtu from ndo_change_mtu()Eric Dumazet
Simon reported that ndo_change_mtu() methods were never updated to use WRITE_ONCE(dev->mtu, new_mtu) as hinted in commit 501a90c94510 ("inet: protect against too small mtu values.") We read dev->mtu without holding RTNL in many places, with READ_ONCE() annotations. It is time to take care of ndo_change_mtu() methods to use corresponding WRITE_ONCE() Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Simon Horman <horms@kernel.org> Closes: https://lore.kernel.org/netdev/20240505144608.GB67882@kernel.org/ Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Simon Horman <horms@kernel.org> Acked-by: Shannon Nelson <shannon.nelson@amd.com> Link: https://lore.kernel.org/r/20240506102812.3025432-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-29ipv6: introduce dst_rt6_info() helperEric Dumazet
Instead of (struct rt6_info *)dst casts, we can use : #define dst_rt6_info(_ptr) \ container_of_const(_ptr, struct rt6_info, dst) Some places needed missing const qualifiers : ip6_confirm_neigh(), ipv6_anycast_destination(), ipv6_unicast_destination(), has_gateway() v2: added missing parts (David Ahern) Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-29mlxsw: pci: Use NAPI for event processingAmit Cohen
Spectrum ASICs only support a single interrupt, that means that all the events are handled by one IRQ (interrupt request) handler. Once an interrupt is received, we schedule tasklet to handle events from EQ and then schedule tasklets to handle completions from CQs. Tasklet runs in softIRQ (software IRQ) context, and will be run on the same CPU which scheduled it. That means that today we use only one CPU to handle all the packets (both network packets and EMADs) from hardware. This can be improved using NAPI. The idea is to use NAPI instance per CQ, which is mapped 1:1 to DQ (RDQ or SDQ). NAPI poll method can be run in kernel thread, so then the driver will be able to handle WQEs in several CPUs. Convert the existing code to use NAPI APIs. Add NAPI instance as part of 'struct mlxsw_pci_queue' and initialize it as part of CQs initialization. Set the appropriate poll method and dummy net device, according to queue number, similar to tasklet setup. For CQs which are used for completions of RDQ, use Rx poll method and 'napi_dev_rx', which is set as 'threaded'. It means that Rx poll method will run in kernel context, so several RDQs will be handled in parallel. For CQs which are used for completions of SDQ, use Tx poll method and 'napi_dev_tx', this method will run in softIRQ context, as it is recommended in NAPI documentation, as Tx packets' processing is short task. Convert mlxsw_pci_cq_{rx,tx}_tasklet() to poll methods. Handle 'budget' argument - ignore it in Tx poll method, as it is recommended to not limit Tx processing. For Rx processing, handle up to 'budget' completions. Return 'work_done' which is the amount of completions that were handled. Handle the following cases: 1. After processing 'budget' completions, the driver still has work to do: Return work-done = budget. In that case, the NAPI instance will be polled again (without the need to be rescheduled). Do not re-arm the queue, as NAPI will handle the reschedule, so we do not have to involve hardware to send an additional interrupt for the completions that should be processed. 2. Event processing has been completed: Call napi_complete_done() to mark NAPI processing as completed, which means that the poll method will not be rescheduled. Re-arm the queue, as all completions were handled. In case that poll method handled exactly 'budget' completions, return work-done = budget -1, to distinguish from the case that driver still has completions to handle. Otherwise, return the amount of completions that were handled. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-29mlxsw: pci: Reorganize 'mlxsw_pci_queue' structureAmit Cohen
The next patch will set the driver to use NAPI for event processing. Then tasklet mechanism will be used only for EQ. Reorganize 'mlxsw_pci_queue' to hold EQ and CQ attributes in a union. For now, add tasklet for both EQ and CQ. This will be changed in the next patch, as 'tasklet_struct' will be replaced with NAPI instance. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-29mlxsw: pci: Initialize dummy net devices for NAPIAmit Cohen
mlxsw will use NAPI for event processing in a next patch. As preparation, add two dummy net devices and initialize them. NAPI instance should be attached to net device. Usually each queue is used by a single net device in network drivers, so the mapping between net device to NAPI instance is intuitive. In our case, Rx queues are not per port, they are per trap-group. Tx queues are mapped to net devices, but we do not have a separate queue for each local port, several ports share the same queue. Use init_dummy_netdev() to initialize dummy net devices for NAPI. To run NAPI poll method in a kernel thread, the net device which NAPI instance is attached to should be marked as 'threaded'. It is recommended to handle Tx packets in softIRQ context, as usually this is a short task - just free the Tx packet which has been transmitted. Rx packets handling is more complicated task, so drivers can use a dedicated kernel thread to process them. It allows processing packets from different Rx queues in parallel. We would like to handle only Rx packets in kernel threads, which means that we will use two dummy net devices (one for Rx and one for Tx). Set only one of them with 'threaded' as it will be used for Rx processing. Do not fail in case that setting 'threaded' fails, as it is better to use regular softIRQ NAPI rather than preventing the driver from loading. Note that the net devices are initialized with init_dummy_netdev(), so they are not registered, which means that they will not be visible to user. It will not be possible to change 'threaded' configuration from user space, but it is reasonable in our case, as there is no another configuration which makes sense, considering that user has no influence on the usage of each queue. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-29mlxsw: pci: Ring RDQ and CQ doorbells once per several completionsAmit Cohen
Currently, for each CQE in CQ, we ring CQ doorbell, then handle RDQ and ring RDQ doorbell. Finally we ring CQ arm doorbell - once per CQ tasklet. The idea of ringing CQ doorbell before RDQ doorbell, is to be sure that when we post new WQE (after RDQ is handled), there is an available CQE. This was done because of a hardware bug as part of commit c9ebea04cb1b ("mlxsw: pci: Ring CQ's doorbell before RDQ's"). There is no real reason to ring RDQ and CQ doorbells for each completion, it is better to handle several completions and reduce number of ringings, as access to hardware is expensive (time wise) and might take time because of memory barriers. A previous patch changed CQ tasklet to handle up to 64 Rx packets. With this limitation, we can ring CQ and RDQ doorbells once per CQ tasklet. The counters of the doorbells are increased by the amount of packets that we handled, then the device will know for which completion to send an additional event. To avoid reordering CQ and RDQ doorbells' ring, let the tasklet to ring also RDQ doorbell, mlxsw_pci_cqe_rdq_handle() handles the counter but does not ring the doorbell. Note that with this change there is no need to copy the CQE, as we ring CQ doorbell only after Rx packet processing (which uses the CQE) is done. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-29mlxsw: pci: Handle up to 64 Rx completions in taskletAmit Cohen
We can get many completions in one interrupt. Currently, the CQ tasklet handles up to half queue size completions, and then arms the hardware to generate additional events, which means that in case that there were additional completions that we did not handle, we will get immediately an additional interrupt to handle the rest. The decision to handle up to half of the queue size is arbitrary and was determined in 2015, when mlxsw driver was added to the kernel. One additional fact that should be taken into account is that while WQEs from RDQ are handled, the CPU that handles the tasklet is dedicated for this task, which means that we might hold the CPU for a long time. Handle WQEs in smaller chucks, then arm CQ doorbell to notify the hardware to send additional notifications. Set the chunk size to 64 as this number is recommended using NAPI and the driver will use NAPI in a next patch. Note that for now we use ARM doorbell to retrigger CQ tasklet, but with NAPI it will be more efficient as software will reschedule the poll method and we will not involve hardware for that. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/ethernet/ti/icssg/icssg_prueth.c net/mac80211/chan.c 89884459a0b9 ("wifi: mac80211: fix idle calculation with multi-link") 87f5500285fb ("wifi: mac80211: simplify ieee80211_assign_link_chanctx()") https://lore.kernel.org/all/20240422105623.7b1fbda2@canb.auug.org.au/ net/unix/garbage.c 1971d13ffa84 ("af_unix: Suppress false-positive lockdep splat for spin_lock() in __unix_gc().") 4090fa373f0e ("af_unix: Replace garbage collection algorithm.") drivers/net/ethernet/ti/icssg/icssg_prueth.c drivers/net/ethernet/ti/icssg/icssg_common.c 4dcd0e83ea1d ("net: ti: icssg-prueth: Fix signedness bug in prueth_init_rx_chns()") e2dc7bfd677f ("net: ti: icssg-prueth: Move common functions into a separate file") No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-24mlxsw: spectrum_acl_tcam: Fix memory leak when canceling rehash workIdo Schimmel
The rehash delayed work is rescheduled with a delay if the number of credits at end of the work is not negative as supposedly it means that the migration ended. Otherwise, it is rescheduled immediately. After "mlxsw: spectrum_acl_tcam: Fix possible use-after-free during rehash" the above is no longer accurate as a non-negative number of credits is no longer indicative of the migration being done. It can also happen if the work encountered an error in which case the migration will resume the next time the work is scheduled. The significance of the above is that it is possible for the work to be pending and associated with hints that were allocated when the migration started. This leads to the hints being leaked [1] when the work is canceled while pending as part of ACL region dismantle. Fix by freeing the hints if hints are associated with a work that was canceled while pending. Blame the original commit since the reliance on not having a pending work associated with hints is fragile. [1] unreferenced object 0xffff88810e7c3000 (size 256): comm "kworker/0:16", pid 176, jiffies 4295460353 hex dump (first 32 bytes): 00 30 95 11 81 88 ff ff 61 00 00 00 00 00 00 80 .0......a....... 00 00 61 00 40 00 00 00 00 00 00 00 04 00 00 00 ..a.@........... backtrace (crc 2544ddb9): [<00000000cf8cfab3>] kmalloc_trace+0x23f/0x2a0 [<000000004d9a1ad9>] objagg_hints_get+0x42/0x390 [<000000000b143cf3>] mlxsw_sp_acl_erp_rehash_hints_get+0xca/0x400 [<0000000059bdb60a>] mlxsw_sp_acl_tcam_vregion_rehash_work+0x868/0x1160 [<00000000e81fd734>] process_one_work+0x59c/0xf20 [<00000000ceee9e81>] worker_thread+0x799/0x12c0 [<00000000bda6fe39>] kthread+0x246/0x300 [<0000000070056d23>] ret_from_fork+0x34/0x70 [<00000000dea2b93e>] ret_from_fork_asm+0x1a/0x30 Fixes: c9c9af91f1d9 ("mlxsw: spectrum_acl: Allow to interrupt/continue rehash work") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Alexander Zubkov <green@qrator.net> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/0cc12ebb07c4d4c41a1265ee2c28b392ff997a86.1713797103.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-24mlxsw: spectrum_acl_tcam: Fix incorrect list API usageIdo Schimmel
Both the function that migrates all the chunks within a region and the function that migrates all the entries within a chunk call list_first_entry() on the respective lists without checking that the lists are not empty. This is incorrect usage of the API, which leads to the following warning [1]. Fix by returning if the lists are empty as there is nothing to migrate in this case. [1] WARNING: CPU: 0 PID: 6437 at drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_tcam.c:1266 mlxsw_sp_acl_tcam_vchunk_migrate_all+0x1f1/0> Modules linked in: CPU: 0 PID: 6437 Comm: kworker/0:37 Not tainted 6.9.0-rc3-custom-00883-g94a65f079ef6 #39 Hardware name: Mellanox Technologies Ltd. MSN3700/VMOD0005, BIOS 5.11 01/06/2019 Workqueue: mlxsw_core mlxsw_sp_acl_tcam_vregion_rehash_work RIP: 0010:mlxsw_sp_acl_tcam_vchunk_migrate_all+0x1f1/0x2c0 [...] Call Trace: <TASK> mlxsw_sp_acl_tcam_vregion_rehash_work+0x6c/0x4a0 process_one_work+0x151/0x370 worker_thread+0x2cb/0x3e0 kthread+0xd0/0x100 ret_from_fork+0x34/0x50 ret_from_fork_asm+0x1a/0x30 </TASK> Fixes: 6f9579d4e302 ("mlxsw: spectrum_acl: Remember where to continue rehash migration") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Alexander Zubkov <green@qrator.net> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/4628e9a22d1d84818e28310abbbc498e7bc31bc9.1713797103.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-24mlxsw: spectrum_acl_tcam: Fix warning during rehashIdo Schimmel
As previously explained, the rehash delayed work migrates filters from one region to another. This is done by iterating over all chunks (all the filters with the same priority) in the region and in each chunk iterating over all the filters. When the work runs out of credits it stores the current chunk and entry as markers in the per-work context so that it would know where to resume the migration from the next time the work is scheduled. Upon error, the chunk marker is reset to NULL, but without resetting the entry markers despite being relative to it. This can result in migration being resumed from an entry that does not belong to the chunk being migrated. In turn, this will eventually lead to a chunk being iterated over as if it is an entry. Because of how the two structures happen to be defined, this does not lead to KASAN splats, but to warnings such as [1]. Fix by creating a helper that resets all the markers and call it from all the places the currently only reset the chunk marker. For good measures also call it when starting a completely new rehash. Add a warning to avoid future cases. [1] WARNING: CPU: 7 PID: 1076 at drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.c:407 mlxsw_afk_encode+0x242/0x2f0 Modules linked in: CPU: 7 PID: 1076 Comm: kworker/7:24 Tainted: G W 6.9.0-rc3-custom-00880-g29e61d91b77b #29 Hardware name: Mellanox Technologies Ltd. MSN3700/VMOD0005, BIOS 5.11 01/06/2019 Workqueue: mlxsw_core mlxsw_sp_acl_tcam_vregion_rehash_work RIP: 0010:mlxsw_afk_encode+0x242/0x2f0 [...] Call Trace: <TASK> mlxsw_sp_acl_atcam_entry_add+0xd9/0x3c0 mlxsw_sp_acl_tcam_entry_create+0x5e/0xa0 mlxsw_sp_acl_tcam_vchunk_migrate_all+0x109/0x290 mlxsw_sp_acl_tcam_vregion_rehash_work+0x6c/0x470 process_one_work+0x151/0x370 worker_thread+0x2cb/0x3e0 kthread+0xd0/0x100 ret_from_fork+0x34/0x50 </TASK> Fixes: 6f9579d4e302 ("mlxsw: spectrum_acl: Remember where to continue rehash migration") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Alexander Zubkov <green@qrator.net> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/cc17eed86b41dd829d39b07906fec074a9ce580e.1713797103.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-24mlxsw: spectrum_acl_tcam: Fix memory leak during rehashIdo Schimmel
The rehash delayed work migrates filters from one region to another. This is done by iterating over all chunks (all the filters with the same priority) in the region and in each chunk iterating over all the filters. If the migration fails, the code tries to migrate the filters back to the old region. However, the rollback itself can also fail in which case another migration will be erroneously performed. Besides the fact that this ping pong is not a very good idea, it also creates a problem. Each virtual chunk references two chunks: The currently used one ('vchunk->chunk') and a backup ('vchunk->chunk2'). During migration the first holds the chunk we want to migrate filters to and the second holds the chunk we are migrating filters from. The code currently assumes - but does not verify - that the backup chunk does not exist (NULL) if the currently used chunk does not reference the target region. This assumption breaks when we are trying to rollback a rollback, resulting in the backup chunk being overwritten and leaked [1]. Fix by not rolling back a failed rollback and add a warning to avoid future cases. [1] WARNING: CPU: 5 PID: 1063 at lib/parman.c:291 parman_destroy+0x17/0x20 Modules linked in: CPU: 5 PID: 1063 Comm: kworker/5:11 Tainted: G W 6.9.0-rc2-custom-00784-gc6a05c468a0b #14 Hardware name: Mellanox Technologies Ltd. MSN3700/VMOD0005, BIOS 5.11 01/06/2019 Workqueue: mlxsw_core mlxsw_sp_acl_tcam_vregion_rehash_work RIP: 0010:parman_destroy+0x17/0x20 [...] Call Trace: <TASK> mlxsw_sp_acl_atcam_region_fini+0x19/0x60 mlxsw_sp_acl_tcam_region_destroy+0x49/0xf0 mlxsw_sp_acl_tcam_vregion_rehash_work+0x1f1/0x470 process_one_work+0x151/0x370 worker_thread+0x2cb/0x3e0 kthread+0xd0/0x100 ret_from_fork+0x34/0x50 ret_from_fork_asm+0x1a/0x30 </TASK> Fixes: 843500518509 ("mlxsw: spectrum_acl: Do rollback as another call to mlxsw_sp_acl_tcam_vchunk_migrate_all()") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Alexander Zubkov <green@qrator.net> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/d5edd4f4503934186ae5cfe268503b16345b4e0f.1713797103.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-24mlxsw: spectrum_acl_tcam: Rate limit error messageIdo Schimmel
In the rare cases when the device resources are exhausted it is likely that the rehash delayed work will fail. An error message will be printed whenever this happens which can be overwhelming considering the fact that the work is per-region and that there can be hundreds of regions. Fix by rate limiting the error message. Fixes: e5e7962ee5c2 ("mlxsw: spectrum_acl: Implement region migration according to hints") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Alexander Zubkov <green@qrator.net> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/c510763b2ebd25e7990d80183feff91cde593145.1713797103.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-24mlxsw: spectrum_acl_tcam: Fix possible use-after-free during rehashIdo Schimmel
The rehash delayed work migrates filters from one region to another according to the number of available credits. The migrated from region is destroyed at the end of the work if the number of credits is non-negative as the assumption is that this is indicative of migration being complete. This assumption is incorrect as a non-negative number of credits can also be the result of a failed migration. The destruction of a region that still has filters referencing it can result in a use-after-free [1]. Fix by not destroying the region if migration failed. [1] BUG: KASAN: slab-use-after-free in mlxsw_sp_acl_ctcam_region_entry_remove+0x21d/0x230 Read of size 8 at addr ffff8881735319e8 by task kworker/0:31/3858 CPU: 0 PID: 3858 Comm: kworker/0:31 Tainted: G W 6.9.0-rc2-custom-00782-gf2275c2157d8 #5 Hardware name: Mellanox Technologies Ltd. MSN3700/VMOD0005, BIOS 5.11 01/06/2019 Workqueue: mlxsw_core mlxsw_sp_acl_tcam_vregion_rehash_work Call Trace: <TASK> dump_stack_lvl+0xc6/0x120 print_report+0xce/0x670 kasan_report+0xd7/0x110 mlxsw_sp_acl_ctcam_region_entry_remove+0x21d/0x230 mlxsw_sp_acl_ctcam_entry_del+0x2e/0x70 mlxsw_sp_acl_atcam_entry_del+0x81/0x210 mlxsw_sp_acl_tcam_vchunk_migrate_all+0x3cd/0xb50 mlxsw_sp_acl_tcam_vregion_rehash_work+0x157/0x1300 process_one_work+0x8eb/0x19b0 worker_thread+0x6c9/0xf70 kthread+0x2c9/0x3b0 ret_from_fork+0x4d/0x80 ret_from_fork_asm+0x1a/0x30 </TASK> Allocated by task 174: kasan_save_stack+0x33/0x60 kasan_save_track+0x14/0x30 __kasan_kmalloc+0x8f/0xa0 __kmalloc+0x19c/0x360 mlxsw_sp_acl_tcam_region_create+0xdf/0x9c0 mlxsw_sp_acl_tcam_vregion_rehash_work+0x954/0x1300 process_one_work+0x8eb/0x19b0 worker_thread+0x6c9/0xf70 kthread+0x2c9/0x3b0 ret_from_fork+0x4d/0x80 ret_from_fork_asm+0x1a/0x30 Freed by task 7: kasan_save_stack+0x33/0x60 kasan_save_track+0x14/0x30 kasan_save_free_info+0x3b/0x60 poison_slab_object+0x102/0x170 __kasan_slab_free+0x14/0x30 kfree+0xc1/0x290 mlxsw_sp_acl_tcam_region_destroy+0x272/0x310 mlxsw_sp_acl_tcam_vregion_rehash_work+0x731/0x1300 process_one_work+0x8eb/0x19b0 worker_thread+0x6c9/0xf70 kthread+0x2c9/0x3b0 ret_from_fork+0x4d/0x80 ret_from_fork_asm+0x1a/0x30 Fixes: c9c9af91f1d9 ("mlxsw: spectrum_acl: Allow to interrupt/continue rehash work") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Alexander Zubkov <green@qrator.net> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/3e412b5659ec2310c5c615760dfe5eac18dd7ebd.1713797103.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-24mlxsw: spectrum_acl_tcam: Fix possible use-after-free during activity updateIdo Schimmel
The rule activity update delayed work periodically traverses the list of configured rules and queries their activity from the device. As part of this task it accesses the entry pointed by 'ventry->entry', but this entry can be changed concurrently by the rehash delayed work, leading to a use-after-free [1]. Fix by closing the race and perform the activity query under the 'vregion->lock' mutex. [1] BUG: KASAN: slab-use-after-free in mlxsw_sp_acl_tcam_flower_rule_activity_get+0x121/0x140 Read of size 8 at addr ffff8881054ed808 by task kworker/0:18/181 CPU: 0 PID: 181 Comm: kworker/0:18 Not tainted 6.9.0-rc2-custom-00781-gd5ab772d32f7 #2 Hardware name: Mellanox Technologies Ltd. MSN3700/VMOD0005, BIOS 5.11 01/06/2019 Workqueue: mlxsw_core mlxsw_sp_acl_rule_activity_update_work Call Trace: <TASK> dump_stack_lvl+0xc6/0x120 print_report+0xce/0x670 kasan_report+0xd7/0x110 mlxsw_sp_acl_tcam_flower_rule_activity_get+0x121/0x140 mlxsw_sp_acl_rule_activity_update_work+0x219/0x400 process_one_work+0x8eb/0x19b0 worker_thread+0x6c9/0xf70 kthread+0x2c9/0x3b0 ret_from_fork+0x4d/0x80 ret_from_fork_asm+0x1a/0x30 </TASK> Allocated by task 1039: kasan_save_stack+0x33/0x60 kasan_save_track+0x14/0x30 __kasan_kmalloc+0x8f/0xa0 __kmalloc+0x19c/0x360 mlxsw_sp_acl_tcam_entry_create+0x7b/0x1f0 mlxsw_sp_acl_tcam_vchunk_migrate_all+0x30d/0xb50 mlxsw_sp_acl_tcam_vregion_rehash_work+0x157/0x1300 process_one_work+0x8eb/0x19b0 worker_thread+0x6c9/0xf70 kthread+0x2c9/0x3b0 ret_from_fork+0x4d/0x80 ret_from_fork_asm+0x1a/0x30 Freed by task 1039: kasan_save_stack+0x33/0x60 kasan_save_track+0x14/0x30 kasan_save_free_info+0x3b/0x60 poison_slab_object+0x102/0x170 __kasan_slab_free+0x14/0x30 kfree+0xc1/0x290 mlxsw_sp_acl_tcam_vchunk_migrate_all+0x3d7/0xb50 mlxsw_sp_acl_tcam_vregion_rehash_work+0x157/0x1300 process_one_work+0x8eb/0x19b0 worker_thread+0x6c9/0xf70 kthread+0x2c9/0x3b0 ret_from_fork+0x4d/0x80 ret_from_fork_asm+0x1a/0x30 Fixes: 2bffc5322fd8 ("mlxsw: spectrum_acl: Don't take mutex in mlxsw_sp_acl_tcam_vregion_rehash_work()") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Alexander Zubkov <green@qrator.net> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/1fcce0a60b231ebeb2515d91022284ba7b4ffe7a.1713797103.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>