summaryrefslogtreecommitdiff
path: root/drivers/net/ethernet/mellanox/mlx4/catas.c
AgeCommit message (Collapse)Author
2019-09-13mlx4: Split restart_one into two functionsJiri Pirko
Split the function restart_one into two functions and separate teardown and buildup. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08drivers: Remove explicit invocations of mmiowb()Will Deacon
mmiowb() is now implied by spin_unlock() on architectures that require it, so there is no reason to call it from driver code. This patch was generated using coccinelle: @mmiowb@ @@ - mmiowb(); and invoked as: $ for d in drivers include/linux/qed sound; do \ spatch --include-headers --sp-file mmiowb.cocci --dir $d --in-place; done NOTE: mmiowb() has only ever guaranteed ordering in conjunction with spin_unlock(). However, pairing each mmiowb() removal in this patch with the corresponding call to spin_unlock() is not at all trivial, so there is a small chance that this change may regress any drivers incorrectly relying on mmiowb() to order MMIO writes between CPUs using lock-free synchronisation. If you've ended up bisecting to this commit, you can reintroduce the mmiowb() calls using wmb() instead, which should restore the old behaviour on all architectures other than some esoteric ia64 systems. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Will Deacon <will.deacon@arm.com>
2018-07-12net/mlx4_core: Add Crdump FW snapshot supportAlex Vesker
Crdump allows the driver to create a snapshot of the FW PCI crspace and health buffer during a critical FW issue. In case of a FW command timeout, FW getting stuck or a non zero value on the catastrophic buffer, a snapshot will be taken. The snapshot is exposed using devlink, cr-space, fw-health address regions are registered on init and snapshots are attached once a new snapshot is collected by the driver. Signed-off-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-05mlx4: Add support for devlink reload and load driverinit valuesMoshe Shemesh
Add mlx4_devlink_reload() to support devlink reload operation. Add mlx4_devlink_param_load_driverinit_values() to load values which were set using driverinit configuration mode. Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-09net/mlx4_core: Convert timers to use timer_setup()Kees Cook
In preparation for unconditionally passing the struct timer_list pointer to all timer callbacks, switch to using the new timer_setup() and from_timer() to pass the timer pointer explicitly. Cc: Tariq Toukan <tariqt@mellanox.com> Cc: netdev@vger.kernel.org Cc: linux-rdma@vger.kernel.org Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-01-30net/mlx4_core: Avoid command timeouts during VF driver device shutdownJack Morgenstein
Some Hypervisors detach VFs from VMs by instantly causing an FLR event to be generated for a VF. In the mlx4 case, this will cause that VF's comm channel to be disabled before the VM has an opportunity to invoke the VF device's "shutdown" method. The result is that the VF driver on the VM will experience a command timeout during the shutdown process when the Hypervisor does not deliver a command-completion event to the VM. To avoid FW command timeouts on the VM when the driver's shutdown method is invoked, we detect the absence of the VF's comm channel at the very start of the shutdown process. If the comm-channel has already been disabled, we cause all FW commands during the device shutdown process to immediately return success (and thus avoid all command timeouts). Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-17net/mlx4_core: Do not BUG_ON during reset when PCI is offlineDaniel Jurgens
The PCI channel could go offline during reset due to EEH. Don't bug on in this case, the error is recoverable. Fixes: f6bc11e42646 ('net/mlx4_core: Enhance the catas flow to support device reset') Signed-off-by: Daniel Jurgens <danielj@mellanox.com> Reviewed-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-25net/mlx4_core: Enable device recovery flow with SRIOVYishai Hadas
In SRIOV, both the PF and the VF may attempt device recovery whenever they assume that the device is not functioning. When the PF driver resets the device, the VF should detect this and attempt to reinitialize itself. The VF must be able to reset itself under all circumstances, even if the PF is not responsive. The VF shall reset itself in the following cases: 1. Commands are not processed within reasonable time over the communication channel. This is done considering device state and the correct return code based on the command as was done in the native mode, done in the next patch. 2. The VF driver receives an internal error event reported by the PF on the communication channel. This occurs when the PF driver resets the device or when VF is out of sync with the PF. Add 'VF reset' capability, which allows the VF to reinitialize itself even when the PF is not responsive. As PF and VF may run their reset flow simulantanisly, there are several cases that are handled: - Prevent freeing VF resources upon FLR, when PF is in its unloading stage. - Prevent PF getting VF commands before it has finished initializing its resources. - Upon VF startup, check that comm-channel is online before sending commands to the PF and getting timed-out. Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-25net/mlx4_core: Manage interface state for Reset flow casesYishai Hadas
We need to manage interface state to sync between reset flow and some other relative cases such as remove_one. This has to be done to prevent certain races. For example in case software stack is down as a result of unload call, the remove_one should skip the unload phase. Implement the remove_one case, handling AER and other cases comes next. The interface can be up/down, upon remove_one, the state will include an extra bit indicating that the device is cleaned-up, forcing other tasks to finish before the final cleanup. Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-25net/mlx4_core: Activate reset flow upon fatal command casesYishai Hadas
We activate reset flow upon command fatal errors, when the device enters an erroneous state, and must be reset. The cases below are assumed to be fatal: FW command timed-out, an error from FW on closing commands, pci is offline when posting/pending a command. In those cases we place the device into an error state: chip is reset, pending commands are awakened and completed immediately. Subsequent commands will return immediately. The return code in the above cases will depend on the command. Commands which free and close resources will return success (because the chip was reset, so callers may safely free their kernel resources). Other commands will return -EIO. Since the device's state was marked as error, the catas poller will detect this and restart the device's software stack (as is done when a FW internal error is directly detected). The device state is protected by a persistent mutex lives on its mlx4_dev, as such no need any more for the hcr_mutex which is removed. Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-25net/mlx4_core: Enhance the catas flow to support device resetYishai Hadas
This includes: - resetting the chip when a fatal error is detected (the current code does not do this). - exposing the ability to enter error state from outside the catas code by calling its functionality. (E.g. FW Command timeout, AER error). - managing a persistent device state. This is needed to sync between reset flow cases. Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-25net/mlx4_core: Refactor the catas flow to work per deviceYishai Hadas
Using a WQ per device instead of a single global WQ, this allows independent reset handling per device even when SRIOV is used. This comes as a pre-patch for supporting chip reset for both native and SRIOV. Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-25net/mlx4_core: Maintain a persistent memory for mlx4 deviceYishai Hadas
Maintain a persistent memory that should survive reset flow/PCI error. This comes as a preparation for coming series to support above flows. Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-25mlx4: Add support for EEH error recoveryKleber Sacilotto de Souza
Currently the mlx4 drivers don't have the necessary callbacks to implement EEH errors detection and recovery, so the PCI layer uses the probe and remove callbacks to try to recover the device after an error on the bus. However, these callbacks have race conditions with the internal catastrophic error recovery functions, which will also detect the error and this can cause the system to crash if both EEH and catas functions try to reset the device. This patch adds the necessary error recovery callbacks and makes sure that the internal catastrophic error functions will not try to reset the device in such scenarios. It also adds some calls to pci_channel_offline() to suppress reads/writes on the bus when the slot cannot accept I/O operations so we prevent unnecessary accesses to the bus and speed up the device removal. Signed-off-by: Kleber Sacilotto de Souza <klebers@linux.vnet.ibm.com> Acked-by: Shlomo Pongratz <shlomop@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-13mlx4_core: adjust catas operation for SRIOV modeJack Morgenstein
When running in SRIOV mode, driver should not automatically start/stop the mlx4_core upon sensing an HCA internal error -- doing this disables/enables sriov, which will cause the hypervisor to hang if there are running VMs with attached VFs. In addition, on VMs the catas process should not run at all, since the HCA error buffer is not available to VMs in the BARs. Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-10-31drivers/net: Add module.h to drivers who were implicitly using itPaul Gortmaker
The device.h header was including module.h, making it present for most of these drivers. But we want to clean that up. Call out the include of module.h in the modular network drivers. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-08-11mlx4: Move the Mellanox driverJeff Kirsher
Moves the Mellanox driver into drivers/net/ethernet/mellanox/ and make the necessary Kconfig and Makefile changes. CC: Roland Dreier <roland@kernel.org> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>