net/mlx5e: Set default burst period for TX and RX reporters

System errors can sometimes cause multiple errors to be reported to the TX reporter at the same time. For instance, lost interrupts may cause several SQs to time out simultaneously. When dev_watchdog notifies the driver for that, it iterates over all SQs to trigger recovery for the timed-out ones, via TX health reporter. However, grace period allows only one recovery at a time, so only the first SQ recovers while others remain blocked. Since no further recoveries are allowed during the grace period, subsequent errors cause the reporter to enter an ERROR state, requiring manual intervention. To address this, set the TX reporter's default burst period to 0.5 second. This allows the reporter to detect and handle all timed-out SQs within this window before initiating the grace period. To account for the possibility of a similar issue in the RX reporter, its default burst period is also configured. Additionally, while here, align the TX definition prefix with the RX, as these are used only in EN driver. Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250824084354.533182-6-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
author: Shahar Shitrit <shshitrit@nvidia.com> 2025-08-24 11:43:54 +0300
committer: Jakub Kicinski <kuba@kernel.org> 2025-08-26 17:24:16 -0700
commit: 2d5ccb93bbb4f161ccb677ce0b2ebcfe4a089d62 (patch)
tree: d4f5f425e3f1ef240c3b0d846f7eaca8aca7a157
parent: da0e2197645c8e01bb6080c7a2b86d9a56cc64a9 (diff)
2 files changed, 6 insertions, 2 deletions
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_rx.c
index 1b9ea72abc5a..eb1cace5910c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_rx.c
@@ -652,6 +652,7 @@ void mlx5e_reporter_icosq_resume_recovery(struct mlx5e_channel *c)
 }
 
 #define MLX5E_REPORTER_RX_GRACEFUL_PERIOD 500
+#define MLX5E_REPORTER_RX_BURST_PERIOD 500
 
 static const struct devlink_health_reporter_ops mlx5_rx_reporter_ops = {
 	.name = "rx",
@@ -659,6 +660,7 @@ static const struct devlink_health_reporter_ops mlx5_rx_reporter_ops = {
 	.diagnose = mlx5e_rx_reporter_diagnose,
 	.dump = mlx5e_rx_reporter_dump,
 	.default_graceful_period = MLX5E_REPORTER_RX_GRACEFUL_PERIOD,
+	.default_burst_period = MLX5E_REPORTER_RX_BURST_PERIOD,
 };
 
 void mlx5e_reporter_rx_create(struct mlx5e_priv *priv)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
index 7a4a77f6fe6a..5a4fe8403a21 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/reporter_tx.c
@@ -539,14 +539,16 @@ void mlx5e_reporter_tx_ptpsq_unhealthy(struct mlx5e_ptpsq *ptpsq)
 	mlx5e_health_report(priv, priv->tx_reporter, err_str, &err_ctx);
 }
 
-#define MLX5_REPORTER_TX_GRACEFUL_PERIOD 500
+#define MLX5E_REPORTER_TX_GRACEFUL_PERIOD 500
+#define MLX5E_REPORTER_TX_BURST_PERIOD 500
 
 static const struct devlink_health_reporter_ops mlx5_tx_reporter_ops = {
 		.name = "tx",
 		.recover = mlx5e_tx_reporter_recover,
 		.diagnose = mlx5e_tx_reporter_diagnose,
 		.dump = mlx5e_tx_reporter_dump,
-		.default_graceful_period = MLX5_REPORTER_TX_GRACEFUL_PERIOD,
+		.default_graceful_period = MLX5E_REPORTER_TX_GRACEFUL_PERIOD,
+		.default_burst_period = MLX5E_REPORTER_TX_BURST_PERIOD,
 };
 
 void mlx5e_reporter_tx_create(struct mlx5e_priv *priv)
author	Shahar Shitrit <shshitrit@nvidia.com>	2025-08-24 11:43:54 +0300
committer	Jakub Kicinski <kuba@kernel.org>	2025-08-26 17:24:16 -0700
commit	2d5ccb93bbb4f161ccb677ce0b2ebcfe4a089d62 (patch)
tree	d4f5f425e3f1ef240c3b0d846f7eaca8aca7a157
parent	da0e2197645c8e01bb6080c7a2b86d9a56cc64a9 (diff)