summaryrefslogtreecommitdiff
path: root/drivers/net/ethernet/mellanox/mlx5/core/health.c
AgeCommit message (Collapse)Author
2024-03-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: net/core/page_pool_user.c 0b11b1c5c320 ("netdev: let netlink core handle -EMSGSIZE errors") 429679dcf7d9 ("page_pool: fix netlink dump stop/resume") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-01net/mlx5: Fix fw reporter diagnose outputAya Levin
Restore fw reporter diagnose to print the syndrome even if it is zero. Following the cited commit, in this case (syndrome == 0) command returns no output at all. This fix restores command output in case syndrome is cleared: $ devlink health diagnose pci/0000:82:00.0 reporter fw Syndrome: 0 Fixes: d17f98bf7cc9 ("net/mlx5: devlink health: use retained error fmsg API") Signed-off-by: Aya Levin <ayal@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-02-05net/mlx5: Remove initial segmentation duplicate definitionsGal Pressman
Device definitions belong in mlx5_ifc, remove the duplicates in mlx5_core.h. Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Jianbo Liu <jianbol@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-02-05net/mlx5: remove fw reporter dump option for non PFMoshe Shemesh
In case function is not a Physical Function it is not allowed to get FW core dump, so if tried it will fail the fw health reporter dump option. Instead of failing, remove the option of fw_fatal health reporter dump for such function. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Aya Levin <ayal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2024-02-05net/mlx5: remove fw_fatal reporter dump option for non PFMoshe Shemesh
In case function is not a Physical Function it is not allowed to collect crdump, so if tried it will fail the fw_fatal health reporter dump option. Instead of failing on permission, remove the option of fw_fatal health reporter dump for such function. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Aya Levin <ayal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-10-20net/mlx5: devlink health: use retained error fmsg APIPrzemek Kitszel
Drop unneeded error checking. devlink_fmsg_*() family of functions is now retaining errors, so there is no need to check for them after each call. Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-09-19net/mlx5: Add a health error syndrome for pci data poisonedMoshe Shemesh
Add new health error syndrome to indicate that pci data poisoned error has been received while fetching device ICM data. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-08-21net/mlx5: Remove health syndrome enum duplicationGal Pressman
Health syndrome enum values were duplicated in mlx5_ifc and health.c, the correct place for them is mlx5_ifc. Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-06-16net/mlx5: Add header file for eventsJuhee Kang
Separate the event API defined in the generic mlx5.h header into a dedicated header. And remove the TODO comment in commit 69c1280b1f3b ("net/mlx5: Device events, Use async events chain"). Signed-off-by: Juhee Kang <claudiajkang@gmail.com> Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-06-09net/mlx5: Light probe local SFsShay Drory
In case user wants to configure the SFs, for example: to use only vdpa functionality, he needs to fully probe a SF, configure what he wants, and afterward reload the SF. In order to save the time of the reload, local SFs will probe without any auxiliary sub-device, so that the SFs can be configured prior to its full probe. The defaults of the enable_* devlink params of these SFs are set to false. Usage example: Create SF: $ devlink port add pci/0000:08:00.0 flavour pcisf pfnum 0 sfnum 11 $ devlink port function set pci/0000:08:00.0/32768 \ hw_addr 00:00:00:00:00:11 state active Enable ETH auxiliary device: $ devlink dev param set auxiliary/mlx5_core.sf.1 \ name enable_eth value true cmode driverinit Now, in order to fully probe the SF, use devlink reload: $ devlink dev reload auxiliary/mlx5_core.sf.1 At this point the user have SF devlink instance with auxiliary device for the Ethernet functionality only. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-04-20net/mlx5: Add vnic devlink health reporter to PFs/VFsMaher Sanalla
Create a vnic devlink health reporter for PFs/VFs interfaces. The reporter's diagnose callback displays the values of vNIC/vport transport debug counters of PFs/VFs, as follows: $ devlink health diagnose pci/0000:08:00.0 reporter vnic vNIC env counters: total_error_queues: 0 send_queue_priority_update_flow: 0 comp_eq_overrun: 0 async_eq_overrun: 0 cq_overrun: 0 invalid_command: 0 quota_exceeded_command: 0 nic_receive_steering_discard: 0 Moreover, add documentation on the reporter functionality and the counters description. While at it, expose the vNIC counters diagnose function to be used by the downstream patch, which will reveal the counters for representor interfaces. Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-03-15net/mlx5: Stop waiting for PCI up if teardown was triggeredMoshe Shemesh
If driver teardown is called while PCI is turned off, there is a race between health recovery and teardown. If health recovery already started it will wait 60 sec trying to see if PCI gets back and it can recover, but actually there is no need to wait anymore once teardown was called. Use the MLX5_BREAK_FW_WAIT flag which is set on driver teardown to break waiting for PCI up. Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Link: https://lore.kernel.org/r/20230314054234.267365-3-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-14net/mlx5: Suspend auxiliary devices only in case of PCI device suspendJiri Pirko
The original behavior introduced by commit c6acd629eec7 ("net/mlx5e: Add support for devlink-port in non-representors mode") correctly re-instantiated uplink devlink port and related netdevice during devlink reload. However with migration to auxiliary devices, this behaviour changed. Restore the original behaviour and tear down auxiliary devices completely during devlink reload. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-02-07net/mlx5: Remove redundant health work lockShay Drory
Commit 90e7cb78b815 ("net/mlx5: fix missing mutex_unlock in mlx5_fw_fatal_reporter_err_work()") introduced another checking of MLX5_DROP_HEALTH_NEW_WORK. At this point, the first check of MLX5_DROP_HEALTH_NEW_WORK is redundant and so is the lock that protects it. Remove the lock and rename MLX5_DROP_HEALTH_NEW_WORK to reflect these changes. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2023-01-18net/mlx5: fix missing mutex_unlock in mlx5_fw_fatal_reporter_err_work()Yang Yingliang
Add missing mutex_unlock() before returning from mlx5_fw_fatal_reporter_err_work(). Fixes: 9078e843efec ("net/mlx5: Avoid recovery in probe flows") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-12-28net/mlx5: Avoid recovery in probe flowsShay Drory
Currently, recovery is done without considering whether the device is still in probe flow. This may lead to recovery before device have finished probed successfully. e.g.: while mlx5_init_one() is running. Recovery flow is using functionality that is loaded only by mlx5_init_one(), and there is no point in running recovery without mlx5_init_one() finished successfully. Fix it by waiting for probe flow to finish and checking whether the device is probed before trying to perform recovery. Fixes: 51d138c2610a ("net/mlx5: Fix health error state handling") Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-10-03net/mlx5: Set default grace period based on function typeMaher Sanalla
Currently, driver sets the same grace period for fw fatal health reporter to any type of function. Since the lower level functions are more vulnerable to fw fatal errors as a result of parent function closure/reload, set a smaller grace period for the lower level functions, as follows: 1. For ECPF: 180 seconds. 2. For PF: 60 seconds. 3. For VF/SF: 30 seconds. Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-03net/mlx5: Start health poll at earlier stage of driver loadMoshe Shemesh
Start health poll at earlier stage, so if fw fatal issue occurred before or during initialization commands such as init_hca or set_hca_cap the poll health can detect and indicate that the driver is already in error state. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-30net/mlx5: Fix spelling mistake "syndrom" -> "syndrome"Colin Ian King
There is a spelling mistake in a devlink_health_report message. Fix it. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-27net/mlx5: Remove unused functionsGal Pressman
Remove functions which are no longer used in the driver: mlx5e_ipsec_is_tx_flow mlx5_health_flush get_cqe_enhanced_num_mini_cqes get_cqe_l3_hdr_type mlx5_health_flush mlx5_fs_is_ipsec_flow _mlx5_fs_is_outer_ipproto_flow mlx5_fs_is_outer_tcp_flow mlx5_fs_is_outer_udp_flow mlx5_fs_is_vxlan_flow mlx5_fs_is_outer_ipsec_flow Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-07-28devlink: Hold the instance lock in health callbacksMoshe Shemesh
Let the core take the devlink instance lock around health callbacks and remove the now redundant locking in the drivers. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-28net/mlx5: Lock mlx5 devlink health recovery callbackMoshe Shemesh
Change devlink instance locks in mlx5 driver to have devlink health recovery callback locked, while keeping all driver paths which lead to devl_ API functions called by the driver locked. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-10net/mlx5: Delete useless module.h includeLeon Romanovsky
There is no need in include of module.h in the following files. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-12-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-30net/mlx5: Fix access to a non-supported registerAya Levin
Validate MRTC register is supported before triggering a delayed work which accesses it. Fixes: 5a1023deeed0 ("net/mlx5: Add periodic update of host time to firmware") Signed-off-by: Aya Levin <ayal@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-11-30net/mlx5: Fix too early queueing of log timestamp workGal Pressman
The log timestamp work should not be queued before the command interface is initialized, move it to a later stage in the init flow. Fixes: 5a1023deeed0 ("net/mlx5: Add periodic update of host time to firmware") Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-11-16net/mlx5: Avoid printing health buffer when firmware is unavailableAya Levin
Use firmware version field as an indication to health buffer's sanity. When firmware version is 0xFFFFFFFF, deduce that firmware is unavailable and avoid printing the health buffer to dmesg as it doesn't provide debug info. Signed-off-by: Aya Levin <ayal@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-25net/mlx5: Add periodic update of host time to firmwareAya Levin
Firmware logs its asserts also to non-volatile memory. In order to reduce drift between the NIC and the host, the driver sets the host epoch-time to the firmware every hour. Signed-off-by: Aya Levin <ayal@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-25net/mlx5: Print health buffer by log levelAya Levin
Add log macro which gets log level as a parameter. Use the severity read from the health buffer and the new log macro to log the health buffer with severity as log level. Prior to this patch, health buffer was printed in error log level regardless of its severity. Now the user may filter dmesg (--level) or change kernel log level to focus on different severity levels of firmware errors. Signed-off-by: Aya Levin <ayal@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-25net/mlx5: Extend health buffer dumpAya Levin
Enhance health buffer to include: - assert_var5: expose the 6'th assert variable. - time: error's time-stamp in seconds (epoch time). - rfr: Recovery Flow Requiered. When set, indicates that the error cannot be recovered without flow involving reset. - severity: error's severity value, ranging from emergency to debug. Expose them in the health buffer dump (dmesg and devlink fw reporter). Health buffer in dmesg: mlx5_core 0000:08:00.0: print_health_info:425:(pid 912): Health issue observed, firmware internal error, severity(3) ERROR: mlx5_core 0000:08:00.0: print_health_info:429:(pid 912): assert_var[0] 0x08040700 mlx5_core 0000:08:00.0: print_health_info:429:(pid 912): assert_var[1] 0x00000000 mlx5_core 0000:08:00.0: print_health_info:429:(pid 912): assert_var[2] 0x00000000 mlx5_core 0000:08:00.0: print_health_info:429:(pid 912): assert_var[3] 0x00000000 mlx5_core 0000:08:00.0: print_health_info:429:(pid 912): assert_var[4] 0x00000000 mlx5_core 0000:08:00.0: print_health_info:429:(pid 912): assert_var[5] 0x00000000 mlx5_core 0000:08:00.0: print_health_info:432:(pid 912): assert_exit_ptr 0x00aaf800 mlx5_core 0000:08:00.0: print_health_info:434:(pid 912): assert_callra 0x00aaf70c mlx5_core 0000:08:00.0: print_health_info:436:(pid 912): fw_ver 16.32.492 mlx5_core 0000:08:00.0: print_health_info:437:(pid 912): time 1634819758 mlx5_core 0000:08:00.0: print_health_info:438:(pid 912): hw_id 0x0000020d mlx5_core 0000:08:00.0: print_health_info:439:(pid 912): rfr 0 mlx5_core 0000:08:00.0: print_health_info:440:(pid 912): severity 3 (ERROR) mlx5_core 0000:08:00.0: print_health_info:441:(pid 912): irisc_index 9 mlx5_core 0000:08:00.0: print_health_info:442:(pid 912): synd 0x1: firmware internal error mlx5_core 0000:08:00.0: print_health_info:444:(pid 912): ext_synd 0x802b mlx5_core 0000:08:00.0: print_health_info:445:(pid 912): raw fw_ver 0x102001ec Signed-off-by: Aya Levin <ayal@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-10-15net/mlx5: Read timeout values from DTORAmir Tzin
Replace hard coded timeouts with values stored by firmware in default timeouts register (DTOR). Timeouts are read during driver load. If DTOR is not supported by firmware then fallback to hard coded defaults instead. Signed-off-by: Amir Tzin <amirtz@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-11net/mlx5: Delete impossible dev->state checksLeon Romanovsky
New mlx5_core device structure is allocated through devlink_alloc with\ kzalloc and that ensures that all fields are equal to zero and it includes ->state too. That means that checks of that field in the mlx5_init_one() is completely redundant, because that function is called only once in the begging of mlx5_core_dev lifetime. PCI: .probe() -> probe_one() -> mlx5_init_one() The recovery flow can't run at that time or before it, because relevant work initialized later in mlx5_init_once(). Such initialization flow ensures that dev->state can't be MLX5_DEVICE_STATE_UNINITIALIZED at all, so remove such impossible checks. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-08-11net/mlx5: Fix typo in commentsCai Huoqing
Fix typo: *vectores ==> vectors *realeased ==> released *erros ==> errors *namepsace ==> namespace *trafic ==> traffic *proccessed ==> processed *retore ==> restore *Currenlty ==> Currently *crated ==> created *chane ==> change *cannnot ==> cannot *usuallly ==> usually *failes ==> fails *importent ==> important *reenabled ==> re-enabled *alocation ==> allocation *recived ==> received *tanslation ==> translation Signed-off-by: Cai Huoqing <caihuoqing@baidu.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-07-27net/mlx5: Unload device upon firmware fatal errorAya Levin
When fw_fatal reporter reports an error, the firmware in not responding. Unload the device to ensure that the driver closes all its resources, even if recovery is not due (user disabled auto-recovery or reporter is in grace period). On successful recovery the device is loaded back up. Fixes: b3bd076f7501 ("net/mlx5: Report devlink health on FW fatal issues") Signed-off-by: Aya Levin <ayal@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-03-25net/mlx5: Fix spelling mistakes in mlx5_core_info messageColin Ian King
There are two spelling mistakes in a mlx5_core_info message. Fix them. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-03-11net/mlx5: Check returned value from health recover sequenceLeon Romanovsky
MLX5_INTERFACE_STATE_UP is far from being reliable check for success to recover, because it can be changed any time and health logic doesn't have any locks to protect from it. The locks are not needed here because health recover is good to have, but not must to success, so rely on the returned value from the mlx5_recover_device() as a marker for success/failure. Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-02-11net/mlx5: Fix health error state handlingShay Drory
Currently, when we discover a fatal error, we are queueing a work that will wait for a lock in order to enter the device to error state. Meanwhile, FW commands are still being processed, and gets timeouts. This can block the driver for few minutes before the work will manage to get the lock and enter to error state. Setting the device to error state before queueing health work, in order to avoid FW commands being processed while the work is waiting for the lock. Fixes: c1d4d2e92ad6 ("net/mlx5: Avoid calling sleeping function by the health poll thread") Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2020-10-09net/mlx5: Handle sync reset request eventMoshe Shemesh
Once the driver gets sync_reset_request from firmware it prepares for the coming reset and sends acknowledge. After getting this event the driver expects device reset, either it will trigger PCI reset on sync_reset_now event or such PCI reset will be triggered by another PF of the same device. So it moves to reset requested mode and if it gets PCI reset triggered by the other PF it detect the reset and reloads. Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-06-11net/mlx5: Fix fatal error handling during device loadShay Drory
Currently, in case of fatal error during mlx5_load_one(), we cannot enter error state until mlx5_load_one() is finished, what can take several minutes until commands will get timeouts, because these commands can't be processed due to the fatal error. Fix it by setting dev->state as MLX5_DEVICE_STATE_INTERNAL_ERROR before requesting the lock. Fixes: c1d4d2e92ad6 ("net/mlx5: Avoid calling sleeping function by the health poll thread") Signed-off-by: Shay Drory <shayd@mellanox.com> Reviewed-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-04-30Merge branch 'mlx5-next' of ↵Saeed Mahameed
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux mlx5 updates for both net-next and rdma-next: 1) HW bits and definitions for TLS and IPsec offlaods 2) Release all pages capability bits 3) New command interface helpers and some code cleanup as a result 4) Move qp.c out of mlx5 core driver into mlx5_ib rdma driver Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-04-19net/mlx5: Delete not-used cmd headerLeon Romanovsky
The structures defined in the cmd header are not used and can be safely removed from the driver. This patch removes that file and deletes all relevant includes. Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2020-04-08net/mlx5: Fix frequent ioread PCI access during recoveryMoshe Shemesh
High frequency of PCI ioread calls during recovery flow may cause the following trace on powerpc: [ 248.670288] EEH: 2100000 reads ignored for recovering device at location=Slot1 driver=mlx5_core pci addr=0000:01:00.1 [ 248.670331] EEH: Might be infinite loop in mlx5_core driver [ 248.670361] CPU: 2 PID: 35247 Comm: kworker/u192:11 Kdump: loaded Tainted: G OE ------------ 4.14.0-115.14.1.el7a.ppc64le #1 [ 248.670425] Workqueue: mlx5_health0000:01:00.1 health_recover_work [mlx5_core] [ 248.670471] Call Trace: [ 248.670492] [c00020391c11b960] [c000000000c217ac] dump_stack+0xb0/0xf4 (unreliable) [ 248.670548] [c00020391c11b9a0] [c000000000045818] eeh_check_failure+0x5c8/0x630 [ 248.670631] [c00020391c11ba50] [c00000000068fce4] ioread32be+0x114/0x1c0 [ 248.670692] [c00020391c11bac0] [c00800000dd8b400] mlx5_error_sw_reset+0x160/0x510 [mlx5_core] [ 248.670752] [c00020391c11bb60] [c00800000dd75824] mlx5_disable_device+0x34/0x1d0 [mlx5_core] [ 248.670822] [c00020391c11bbe0] [c00800000dd8affc] health_recover_work+0x11c/0x3c0 [mlx5_core] [ 248.670891] [c00020391c11bc80] [c000000000164fcc] process_one_work+0x1bc/0x5f0 [ 248.670955] [c00020391c11bd20] [c000000000167f8c] worker_thread+0xac/0x6b0 [ 248.671015] [c00020391c11bdc0] [c000000000171618] kthread+0x168/0x1b0 [ 248.671067] [c00020391c11be30] [c00000000000b65c] ret_from_kernel_thread+0x5c/0x80 Reduce the PCI ioread frequency during recovery by using msleep() instead of cond_resched() Fixes: 3e5b72ac2f29 ("net/mlx5: Issue SW reset on FW assert") Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Reviewed-by: Feras Daoud <ferasda@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-03-30devlink: Implicitly set auto recover flag when registering health reporterEran Ben Elisha
When health reporter is registered to devlink, devlink will implicitly set auto recover if and only if the reporter has a recover method. No reason to explicitly get the auto recover flag from the driver. Remove this flag from all drivers that called devlink_health_reporter_create. All existing health reporters set auto recovery to true if they have a recover method. Yet, administrator can unset auto recover via netlink command as prior to this patch. Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-12net/mlx5: Dump of fw_fatal use updated devlink binary interfaceAya Levin
Remove redundant code from fw_fatal reporter's dump callback. Use updated devlink interface of binary fmsg pair which breaks the output into chunks internally. Signed-off-by: Aya Levin <ayal@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
The only slightly tricky merge conflict was the netdevsim because the mutex locking fix overlapped a lot of driver reload reorganization. The rest were (relatively) trivial in nature. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-18net/mlx5: fix memory leak in mlx5_fw_fatal_reporter_dumpNavid Emamdoost
In mlx5_fw_fatal_reporter_dump if mlx5_crdump_collect fails the allocated memory for cr_data must be released otherwise there will be memory leak. To fix this, this commit changes the return instruction into goto error handling. Fixes: 9b1f29823605 ("net/mlx5: Add support for FW fatal reporter dump") Signed-off-by: Navid Emamdoost <navid.emamdoost@gmail.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-10-11devlink: propagate extack down to health reporter opsJiri Pirko
During health reporter operations, driver might want to fill-up the extack message, so propagate extack down to the health reporter ops. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-22net/mlx5: Fix delay in fw fatal report handling due to fw reportMoshe Shemesh
When fw fatal error occurs, poll health() first detects and reports on a fw error. Afterwards, it detects and reports on the fw fatal error itself. That can cause a long delay in fw fatal error handling which waits in a queue for the fw error handling to be finished. The fw error handle will try asking for fw core dump command while fw in fatal state may not respond and driver will wait for command timeout. Changing the flow to detect and handle first fw fatal errors and only if no fatal error detected look for a fw error to handle. Fixes: d1bf0e2cc4a6 ("net/mlx5: Report devlink health on FW issues") Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-08-22net/mlx5: Fix crdump chunks printMoshe Shemesh
Crdump repeats itself every chunk of 256bytes. That is due to bug of missing progressing offset while copying the data from buffer to devlink_fmsg. Fixes: 9b1f29823605 ("net/mlx5: Add support for FW fatal reporter dump") Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-07-18net/mlx5: Replace kfree with kvfreeChuhong Yuan
Variable allocated by kvmalloc should not be freed by kfree. Because it may be allocated by vmalloc. So replace kfree with kvfree here. Fixes: 9b1f298236057 ("net/mlx5: Add support for FW fatal reporter dump") Signed-off-by: Chuhong Yuan <hslester96@gmail.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>