summaryrefslogtreecommitdiff
path: root/drivers/scsi/lpfc/lpfc_els.c
AgeCommit message (Collapse)Author
2021-09-14scsi: lpfc: Don't remove ndlp on PRLI errors in P2P modeJames Smart
In pt-2-pt mode, the initiator does not log into the target after a PRLI error. In pt-2-pt mode, the target responded to the PRLI by sending a LOGO. The LOGO causes all ELS and I/Os to be aborted. This caused the PRLI to fail. The PRLI completion path caused the discovery node to be dropped to avoid being stick in an UNUSED (not logged in) state. As the node was dropped there is no retry of the login and as it is pt-2-pt, there is no RSCN to retrigger discovery. Thus the other end is not seen by the OS. Fix by ensuring the discovery node is not dropped if connecting pt-2-pt. This will cause PLOGI to be retried. Link: https://lore.kernel.org/r/20210910233159.115896-7-jsmart2021@gmail.com Co-developed-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-09-14scsi: lpfc: Fix hang on unload due to stuck fport nodeJames Smart
A test scenario encountered an unload hang while an FLOGI ELS was in flight when a link down condition occurred. The driver fails unload as it never releases the fport node. For most nodes, when the link drops, devloss tmo is started and the timeout will cause the final node release. For the Fport, as it has not yet registered with the SCSI transport, there is no devloss timer to be started, so there is no final release. Additionally, the link down sequence causes ABORTS to be issued for pending ELS's. The completions from the ABORTS perform the release of node references. However, as the adapter is being reset to be unloaded, those completions will never occur. Fix by the following: - In the ELS cleanup, recognize when unloading and place the ELS's on a different list that immediately cleans up/completes the ELS's. It's recognized that this condition primarily affects only the fport, with other ports having normal clean up logic that handles things. - Resolve the devloss issue by, when cleaning up nodes on after link down, recognizing when the fabric node does not have a completed state (its state is UNUSED) and removing a reference so the node can delete after the ELS reference is released. Link: https://lore.kernel.org/r/20210910233159.115896-5-jsmart2021@gmail.com Co-developed-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-09-14scsi: lpfc: Fix premature rpi release for unsolicited TPLS and LS_RJTJames Smart
A test scenario has a target issuing a TPLS after accepting the driver's PRLI. TPLS is not supported by the driver so it rejects the ELS. However, the reject was only happening on the primary N_Port. If the TPLS was to a NPIV vport, not only would it reject the ELS, but it would act on the TPLS, starting devloss, then unregister from the SCSI transport and release the node. When devloss expired, it would access the node again and cause a page faul. Fix by altering the NPIV code to recognize that a correctly registered node can reject unsolicited ELS I/O and to not unregister with the SCSI transport and tear the node down. Add a check of the fc4_xpt_flags so that only a zero value allows the unreg and teardown. Link: https://lore.kernel.org/r/20210910233159.115896-4-jsmart2021@gmail.com Co-developed-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-09-14scsi: lpfc: Don't release final kref on Fport node while ABTS outstandingJames Smart
In a rarely executed path, FLOGI failure, there is a refcounting error. If FLOGI completed with an error, typically a timeout, the initial completion handler would remove the job reference. However, the job completion isn't the actual end of the job/exchange as the timeout usually initiates an ABTS, and upon that ABTS completion, a final completion is sent. The driver removes the reference again in the final completion. Thus the imbalance. In the buggy cases, if there was a link bounce while the delayed response is outstanding, the fport node may be referenced again but there was no additional reference as it is already present. The delayed completion then occurs and removes the last reference freeing the node and causing issues in the link up processed that is using the node. Fix this scenario by removing the snippet that removed the reference in the initial FLOGI completion. The bad snippet was poorly trying to identify the FLOGI as OK to do so by realizing the node was not registered with either SCSI or NVMe transport. Link: https://lore.kernel.org/r/20210910233159.115896-3-jsmart2021@gmail.com Fixes: 618e2ee146d4 ("scsi: lpfc: Fix FLOGI failure due to accessing a freed node") Cc: <stable@vger.kernel.org> # v5.13+ Co-developed-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-09-13scsi: lpfc: Fix CPU to/from endian warnings introduced by ELS processingJames Smart
The kernel test robot reported the following sparse warning: ".../lpfc_els.c:3984:25: sparse: sparse: cast from restricted __be16" For the error being flagged, using be32_to_cpu() on a be16 data type, it was simple enough. But a review of other elements and warnings were also evaluated. This patch corrected several items in the original patch: - Using be32_to_cpu() on a be16 data type - cpu_to_le32() used on a std uint32_t (CPU) data type. Note: This is a byte array, but stored in LE layout by hardware at 32-bit boundaries. So it possibly needed conversion. - Using cpu_to_le32() on a std uint16_t and assigned to a char typeA - Using le32_to_cpu() on a le16 type - Missing cpu_to_le16() on an assignment Link: https://lore.kernel.org/r/20210830231243.6227-1-jsmart2021@gmail.com Fixes: 9064aeb2df8e ("scsi: lpfc: Add EDC ELS support") Reported-by: kernel test robot <lkp@intel.com> Co-developed-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-08-24scsi: lpfc: Add cmf_info sysfs entryJames Smart
Allow abbreviated cm framework status information to be obtained via sysfs. Link: https://lore.kernel.org/r/20210816162901.121235-14-jsmart2021@gmail.com Co-developed-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-08-24scsi: lpfc: Add support for maintaining the cm statistics bufferJames Smart
Add the logic to move the congestion management and event information into the cmd statistics buffer maintained for the adapter. The update includes rolling up values for the last minute, hour, and day information. Link: https://lore.kernel.org/r/20210816162901.121235-12-jsmart2021@gmail.com Co-developed-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-08-24scsi: lpfc: Add support for the CM frameworkJames Smart
Complete the enablement of the cm framework feature in the adapter. Perform the following: - Detect the presence of the congestion management framework feature. When the cm framework is present: - Issue the SET_FEATURE command to enable the feature. - Register the cm statistics buffer with the adapter. - Read the cm enablement buffer to determine the cm framework state for cm management. When cm management is enabled: - Monitor all FPIN and congestion signalling events, incrementing counters. - Regularly sync with the adapter to communicate congestion events and to receive an rx request limit. - Monitor requests for rx data and ensure that no more than the adapter prescribed limit is issued on the link. If the limit is exceeded, SCSI and/or NVMe traffic is temporarily suspended. - Maintain the minute, hourly, daily statistics buffer. - Monitor for congestion enablement change events, causing a reread of the enablement buffer and acting on any change in enablement. And: - Add teardown logic, including buffer deregistration, on adapter detachment or reset. Link: https://lore.kernel.org/r/20210816162901.121235-10-jsmart2021@gmail.com Co-developed-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-08-24scsi: lpfc: Add EDC ELS supportJames Smart
When congestion management is enabled, issue EDC ELS to register congestion signaling capabilities with the fabric. The response handling will process the fabric parameters and set the reporting parameters. Similarly, add support for receiving an EDC request from the fabric generating a corresponding response. Implement handlers for congestion signals from the fabric and maintain statistics for them. Link: https://lore.kernel.org/r/20210816162901.121235-6-jsmart2021@gmail.com Co-developed-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-08-24scsi: lpfc: Expand FPIN and RDF receive loggingJames Smart
Expand FPIN logging: - Display Attached Port Names for Link Integrity and Peer Congestion events - Log Delivery, Peer Congestion, and Congestion events - Sanity check FPIN descriptor lengths when processing FPIN descriptors. Log RDF events when congestion logging is enabled. Link: https://lore.kernel.org/r/20210816162901.121235-5-jsmart2021@gmail.com Co-developed-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-07-27scsi: lpfc: Add 256 Gb link speed supportJames Smart
Update routines to support 256 Gb link speed for LPe37000/LPe38000 adapters. 256 Gb speeds can be seen on trunk links. Link: https://lore.kernel.org/r/20210722221721.74388-5-jsmart2021@gmail.com Co-developed-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-07-18scsi: lpfc: Skip reg_vpi when link is down for SLI3 in ADISC cmpl pathJames Smart
During RSCN storms, some instances of LIP on SLI-3 adapters lead to a situation where FLOGIs keep failing with firmware indicating an illegal command error code. This situation was preceded by an ADISC completion that was processed while the link was down. This path on SLI-3 performs a CLEAR_LA and attempts to activate a VPI with REG_VPI. Later, as the FLOGI completes, it's no longer in sync with the VPI state. In SLI-3 it is illegal to have an active VPI during FLOGI. Resolve by circumventing the SLI-3 path that performs the CLEAR_LA and REG_VPI. The path will be taken after the FLOGI after the next Link Up. Link: https://lore.kernel.org/r/20210707184351.67872-18-jsmart2021@gmail.com Co-developed-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-07-18scsi: lpfc: Call discovery state machine when handling PLOGI/ADISC completionsJames Smart
In the PLOGI and ADISC completion handling, the device removal event could be skipped during some link errors. This could leave a stale node in UNUSED state. Driver unload would hang for a long time waiting for this node to be freed. Resolve by taking the following steps: - Always post ADISC completion events to discovery state machine upon ADISC completion. - In case of a completion error for PLOGI/ADISC, ensure that init refcount is dropped if not registered with transport. Link: https://lore.kernel.org/r/20210707184351.67872-17-jsmart2021@gmail.com Co-developed-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-07-18scsi: lpfc: Delay unregistering from transport until GIDFT or ADISC completesJames Smart
On an RSCN event, the nodes specified in RSCN payload and in MAPPED state are moved to NPR state in order to revalidate the login. This triggers an immediate unregister from SCSI/NVMe backend. The assumption is that the node may be missing. The re-registration with the backend happens after either relogin (PLOGI/PRLI; if ADISC is disabled or login truly lost) or when ADISC completes successfully (rediscover with ADISC enabled). However, the NVMe-FC standard provides for an RSCN to be triggered when the remote port supports a discovery controller and there was a change of discovery log content. As the remote port typically also supports storage subsystems, this unregister causes all storage controller connections to fail and require reconnect. Correct by reworking the code to ensure that the unregistration only occurs when a login state is truly terminated, thereby leaving the NVMe storage controllers in place. The changes made are: - Retain node state in ADISC_ISSUE when scheduling ADISC ELS retry. - Do not clear wwpn/wwnn values upon ADISC failure. - Move MAPPED nodes to NPR during RSCN processing, but do not unregister with transport. On GIDFT completion, identify missing nodes (not marked NLP_NPR_2B_DISC) and unregister them. - Perform unregistration for nodes that will go through ADISC processing if ADISC completion fails. - Successful ADISC completion will move node back to MAPPED state. Link: https://lore.kernel.org/r/20210707184351.67872-16-jsmart2021@gmail.com Co-developed-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-07-18scsi: lpfc: Remove REG_LOGIN check requirement to issue an ELS RDFJames Smart
Since the REG_LOGIN to the fabric controller happens in parallel with SCR, it may or may not be completed by the time RDF is sent. RDF and SCR are sent to the fabric in parallel, so checking for a completed REG_LOGIN in the RDF submit path is not needed. Remove the REG_LOGI check from the RDF submission path. Link: https://lore.kernel.org/r/20210707184351.67872-11-jsmart2021@gmail.com Co-developed-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-07-18scsi: lpfc: Fix memory leaks in error paths while issuing ELS RDF/SCR requestJames Smart
The ELS job request structure, that is allocated while issuing ELS RDF/SCR request path, is not being released in an error path causing a memory leak message on driver unload. Free the ELS job structure in the error paths. Link: https://lore.kernel.org/r/20210707184351.67872-10-jsmart2021@gmail.com Co-developed-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-07-18scsi: lpfc: Fix NULL ptr dereference with NPIV ports for RDF handlingJames Smart
RDF ELS handling for NPIV ports may result in an incorrect NDLP reference count. In the event of a persistent link down, this may lead to premature release of an NDLP structure and subsequent NULL ptr dereference panic. Remove extraneous lpfc_nlp_put() call in NPIV port RDF processing. Link: https://lore.kernel.org/r/20210707184351.67872-9-jsmart2021@gmail.com Co-developed-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-07-18scsi: lpfc: Discovery state machine fixes for LOGO handlingJames Smart
If a LOGO is received for a node that is in the NPR state, post a DEVICE_RM event to allow clean up of the logged out node. Clearing the NLP_NPR_2B_DISC flag upon receipt of a LOGO ACC may cause skipping of processing outstanding PLOGIs triggered by parallel RSCN events. If an outstanding PLOGI is being retried and receipt of a LOGO ACC occurs, then allow the discovery state machine's PLOGI completion to clean up the node. Link: https://lore.kernel.org/r/20210707184351.67872-6-jsmart2021@gmail.com Co-developed-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-06-10scsi: lpfc: vmid: Implement ELS commands for appidGaurav Srivastava
Implement ELS commands QFPA and UVEM for the priority tagging appid support. Link: https://lore.kernel.org/r/20210608043556.274139-8-muneendra.kumar@broadcom.com Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Gaurav Srivastava <gaurav.srivastava@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Muneendra Kumar <muneendra.kumar@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-05-21scsi: lpfc: Reregister FPIN types if ELS_RDF is received from fabric controllerJames Smart
FC-LS-5 specifies that a received RDF implies a possible change to fabric supported diagnostic functions. Endpoints are to re-perform the RDF exchange with the fabric to enable possible new features or adapt to changes in values. This patch adds the logic to RDF receive to re-perform the RDF exchange with the switch. Link: https://lore.kernel.org/r/20210514195559.119853-11-jsmart2021@gmail.com Co-developed-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-05-21scsi: lpfc: Fix node handling for Fabric Controller and Domain ControllerJames Smart
During link bounce testing, RPI counts were seen to differ from the number of nodes. For fabric and domain controllers, a temporary RPI is assigned, but the code isn't registering it. If the nodes do go away, such as on link down, the temporary RPI isn't being released. Change the way these two fabric services are managed, make them behave like any other remote port. Register the RPI and register with the transport. Never leave the nodes in a NPR or UNUSED state where their RPI is in limbo. This allows them to follow normal dev_loss_tmo handling, RPI refcounting, and normal removal rules. It also allows fabric I/Os to use the RPI for traffic requests. Note: There is some logic that still has a couple of exceptions when the Domain controller (0xfffcXX). There are cases where the fabric won't have a valid login but will send RDP. Other times, it will it send a LOGO then an RDP. It makes for ad-hoc behavior to manage the node. Exceptions are documented in the code. Link: https://lore.kernel.org/r/20210514195559.119853-7-jsmart2021@gmail.com Co-developed-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-05-21scsi: lpfc: Fix Node recovery when driver is handling simultaneous PLOGIsJames Smart
When lpfc is handling a solicited and unsolicited PLOGI with another initiator, the remote initiator is never recovered. The node for the initiator is erroneouosly removed and all resources released. In lpfc_cmpl_els_plogi(), when lpfc_els_retry() returns a failure code, the driver is calling the state machine with a device remove event because the remote port is not currently registered with the SCSI or NVMe transports. The issue is that on a PLOGI "collision" the driver correctly aborts the solicited PLOGI and allows the unsolicited PLOGI to complete the process, but this process is interrupted with a device_rm event. Introduce logic in the PLOGI completion to capture the PLOGI collision event and jump out of the routine. This will avoid removal of the node. If there is no collision, the normal node removal will occur. Fixes: 52edb2caf675 ("scsi: lpfc: Remove ndlp when a PLOGI/ADISC/PRLI/REG_RPI ultimately fails") Cc: <stable@vger.kernel.org> # v5.11+ Link: https://lore.kernel.org/r/20210514195559.119853-6-jsmart2021@gmail.com Co-developed-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-05-21scsi: lpfc: Fix "Unexpected timeout" error in direct attach topologyJames Smart
An 'unexpected timeout' message may be seen in a point-2-point topology. The message occurs when a PLOGI is received before the driver is notified of FLOGI completion. The FLOGI completion failure causes discovery to be triggered for a second time. The discovery timer is restarted but no new discovery activity is initiated, thus the timeout message eventually appears. In point-2-point, when discovery has progressed before the FLOGI completion is processed, it is not a failure. Add code to FLOGI completion to detect that discovery has progressed and exit the FLOGI handling (noop'ing it). Link: https://lore.kernel.org/r/20210514195559.119853-4-jsmart2021@gmail.com Co-developed-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-05-21scsi: lpfc: Fix unreleased RPIs when NPIV ports are createdJames Smart
While testing NPIV and watching logins and used RPI levels, it was seen the used RPI count was much higher than the number of remote ports discovered. Code inspection showed that remote port removals on any NPIV instance are releasing the RPI, but not performing an UNREG_RPI with the adapter thus the reference counting never fully drops and the RPI is never fully released. This was happening on NPIV nodes due to a log of fabric ELS's to fabric addresses. This lack of UNREG_RPI was introduced by a prior node rework patch that performed the UNREG_RPI as part of node cleanup. To resolve the issue, do the following: - Restore the RPI release code, but move the location to so that it is in line with the new node cleanup design. - NPIV ports now release the RPI and drop the node when the caller sets the NLP_RELEASE_RPI flag. - Set the NLP_RELEASE_RPI flag in node cleanup which will trigger a release of RPI to free pool. - Ensure there's an UNREG_RPI at LOGO completion so that RPI release is completed. - Stop offline_prep from skipping nodes that are UNUSED. The RPI may not have been released. - Stop the default RPI handling in lpfc_cmpl_els_rsp() for SLI4. - Fixed up debugfs RPI displays for better debugging. Fixes: a70e63eee1c1 ("scsi: lpfc: Fix NPIV Fabric Node reference counting") Link: https://lore.kernel.org/r/20210514195559.119853-2-jsmart2021@gmail.com Cc: <stable@vger.kernel.org> # v5.11+ Co-developed-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-04-13scsi: lpfc: Fix various trivial errors in comments and log messagesJames Smart
Clean up minor issues spotted by tools and code review: - Spelling Errors - Spurious characters and errors in function headers - nvme_info wqerr and err fields source data reversed - Extraneous new line in log message 0466 - Spacing error in log message 0109 - Messages 0140 and 0141 have portname and nodename reversed - Incorrect function labelling in comment Link: https://lore.kernel.org/r/20210412013127.2387-13-jsmart2021@gmail.com Co-developed-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-04-13scsi: lpfc: Fix use-after-free on unused nodes after port swapJames Smart
During target port swap, the swap logic ignores the DROPPED flag in the nodes. As a node then moves into the UNUSED state, the reference count will be dropped. If a node is later reused and moved out of the UNUSED state, an access can result in a use-after-free assert. Fix by having the port swap logic propagate the DROPPED flag when switching nodes. This will avoid reference from being dropped. Link: https://lore.kernel.org/r/20210412013127.2387-8-jsmart2021@gmail.com Co-developed-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-04-13scsi: lpfc: Fix lack of device removal on port swaps with PRLIsJames Smart
During target port-swap testing with link flips, the initiator could encounter PRLI errors. If the target node disappears permanently, the ndlp is found stuck in UNUSED state with ref count of 1. The rmmod of the driver will hang waiting for this node to be freed. While handling a link error in PRLI completion path, the code intends to skip triggering the discovery state machine. However this is causing the final reference release path to be skipped. This causes the node to be stuck with ref count of 1 Fix by ensuring the code path triggers the device removal event on the node state machine. Link: https://lore.kernel.org/r/20210412013127.2387-6-jsmart2021@gmail.com Co-developed-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-04-13scsi: lpfc: Fix NMI crash during rmmod due to circular hbalock dependencyJames Smart
Remove hbalock dependency for lpfc_abts_els_sgl_list and lpfc_abts_nvmet_ctx_list. The lists are adaquately synchronized with the sgl_list_lock and abts_nvmet_buf_list_lock. Link: https://lore.kernel.org/r/20210412013127.2387-5-jsmart2021@gmail.com Co-developed-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-04-13scsi: lpfc: Fix reference counting errors in lpfc_cmpl_els_rsp()James Smart
Call traces are being seen that result from a nodelist structure ref counting error. They are typically seen after transmission of an LS_RJT ELS response. Aged code in lpfc_cmpl_els_rsp() calls lpfc_nlp_not_used() which, if the ndlp reference count is exactly 1, will decrement the reference count. Previously lpfc_nlp_put() was within lpfc_els_free_iocb(), and the 'put' within the free would only be invoked if cmdiocb->context1 was not NULL. Since the nodelist structure reference count is decremented when exiting lpfc_cmpl_els_rsp() the lpfc_nlp_not_used() calls are no longer required. Calling them is causing the reference count issue. Fix by removing the lpfc_nlp_not_used() calls. Link: https://lore.kernel.org/r/20210412013127.2387-4-jsmart2021@gmail.com Co-developed-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: Justin Tee <justin.tee@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-03-24scsi: lpfc: Fix a typoBhaskar Chowdhury
s/conditons/conditions/ Link: https://lore.kernel.org/r/20210324064829.32092-1-unixbhaskar@gmail.com Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-03-04scsi: lpfc: Correct function header comments related to ndlp reference countingJames Smart
Code inspection revealed stale comments in function headers for functions that call lpfc_prep_els_iocb(). Changes in ndlp reference counting were not reflected in function headers. Update the stale comments in function headers to more accurately indicate ndlp reference counting. Link: https://lore.kernel.org/r/20210301171821.3427-21-jsmart2021@gmail.com Co-developed-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-03-04scsi: lpfc: Fix ADISC handling that never frees nodesJames Smart
While testing target port swap test with ADISC enabled, several nodes remain in UNUSED state. These nodes are never freed and rmmod hangs for long time before finising with "0233 Nodelist not empty" error. During PLOGI completion lpfc_plogi_confirm_nport() looks for existing nodes with same WWPN. If found, the existing node is used to continue discovery. The node on which plogi was performed is freed. When ADISC is enabled, an ADISC els request is triggered in response to an RSCN. It's possible that the ADISC may be rejected by the remote port causing the ADISC completion handler to clear the port and node name in the node. If this occurs, if a PLOGI is received it causes a node lookup based on wwpn to now fail, causing the port swap logic to kick in which allocates a new node and swaps to it. This effectively orphans the original node structure. Fix the situation by detecting when the lookup fails and forgo the node swap and node allocation by using the node on which the PLOGI was issued. Link: https://lore.kernel.org/r/20210301171821.3427-15-jsmart2021@gmail.com Co-developed-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-03-04scsi: lpfc: Fix dropped FLOGI during pt2pt discovery recoveryJames Smart
When connected in pt2pt mode, there is a scenario where the remote port significantly delays sending a response to our FLOGI, but acts on the FLOGI it sent us and proceeds to PLOGI/PRLI. The FLOGI ends up timing out and kicks off recovery logic. End result is a lot of unnecessary state changes and lots of discovery messages being logged. Fix by terminating the FLOGI and noop'ing its completion if we have already accepted the remote ports FLOGI and are now processing PLOGI. Link: https://lore.kernel.org/r/20210301171821.3427-13-jsmart2021@gmail.com Co-developed-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-03-04scsi: lpfc: Fix status returned in lpfc_els_retry() error exit pathJames Smart
An unlikely error exit path from lpfc_els_retry() returns incorrect status to a caller, erroneously indicating that a retry has been successfully issued or scheduled. Change error exit path to indicate no retry. Link: https://lore.kernel.org/r/20210301171821.3427-12-jsmart2021@gmail.com Co-developed-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-03-04scsi: lpfc: Fix use after free in lpfc_els_free_iocbJames Smart
There are several code paths where the following sequence occurs: - An ndlp pointer is assigned to an iocb via a nlp_get() - An attempt is made to issue the iocb, but it fails - The failure case does a put on the ndlp then calls lpfc_els_free_iocb() The put may free the ndlp structure, but the els_free_iocb may reference the now-stale ndlp pointer and cause a crash. Fix by ensuring that the lpfc_els_free_iocb() occurs before the lpfc_nlp_put(). While fixing, refactor the code to better ensure this calling sequence. Link: https://lore.kernel.org/r/20210301171821.3427-11-jsmart2021@gmail.com Co-developed-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-03-04scsi: lpfc: Fix null pointer dereference in lpfc_prep_els_iocb()James Smart
It is possible to call lpfc_issue_els_plogi() passing a did for which no matching ndlp is found. A call is then made to lpfc_prep_els_iocb() with a null pointer to a lpfc_nodelist structure resulting in a null pointer dereference. Fix by returning an error status if no valid ndlp is found. Fix up comments regarding ndlp reference counting. Link: https://lore.kernel.org/r/20210301171821.3427-10-jsmart2021@gmail.com Fixes: 4430f7fd09ec ("scsi: lpfc: Rework locations of ndlp reference taking") Co-developed-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-03-04scsi: lpfc: Fix lpfc_els_retry() possible null pointer dereferenceJames Smart
Driver crashed in lpfc_debugfs_disc_trc() due to null ndlp pointer. In some calling cases, the ndlp is null and the did is looked up. Fix by using the local did variable that is set appropriately based on ndlp value. Link: https://lore.kernel.org/r/20210301171821.3427-7-jsmart2021@gmail.com Co-developed-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-03-04scsi: lpfc: Fix FLOGI failure due to accessing a freed nodeJames Smart
After an initial successful FLOGI into the switch, if a subsequent FLOGI fails the driver crashed accessing a node struct. On FLOGI error, the flogi completion logic triggers the final dereference on the node structure without checking if it is registered with a backend. The devloss logic is triggered after node is freed leading to the access of freed node. Fix by adjusting the error path to not take the final dereferece if there is an outstanding transport registration. Let the transport devloss call remove the final reference. Link: https://lore.kernel.org/r/20210301171821.3427-6-jsmart2021@gmail.com Co-developed-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-03-04scsi: lpfc: Fix stale node accesses on stale RRQ requestJames Smart
Whenever an RRQ needs to be triggered, the DID from the node structure and node pointer are stored in the RRQ data structure and the RRQ is scheduled for later transmission. However, at the point in time that the timer triggers, there's no validation on the node pointer. Reference counters may have freed the structure. Additionally the DID in the node may no longer be valid. Fix by not tracking the node pointer in the RRQ, only the DID. At the time of the timer expiration, look up the node with the did and if present, send the RRQ. If no node exists, no need to send the RRQ. Link: https://lore.kernel.org/r/20210301171821.3427-5-jsmart2021@gmail.com Co-developed-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-01-29scsi: lpfc: Fix 'physical' typosBjorn Helgaas
Fix misspellings of "physical". Link: https://lore.kernel.org/r/20210126211248.2920028-1-helgaas@kernel.org Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-01-07scsi: lpfc: Implement health checking when aborting I/OJames Smart
Several errors have occurred where the adapter stops or fails but does not raise the register values for the driver to detect failure. Thus driver is unaware of the failure. The failure typically results in I/O timeouts, the I/O timeout handler failing (after several seconds), and the error handler escalating recovery policy and resulting in more errors. Eventually, the driver is in a position where things have spiraled and it can't do recovery because other recovery ops are still outstanding and it becomes unusable. Resolve the situation by having the I/O timeout handler (actually a els, SCSI I/O, NVMe ls, or NVMe I/O timeout), in addition to aborting the I/O, perform a mailbox command and look for a response from the hardware. If the mailbox command fails, it will mark the adapter offline and then invoke the adapter reset handler to clean up. The new I/O timeout test will be limited to a test every 5s. If there are multiple I/O timeouts concurrently, only the 1st I/O timeout will generate the mailbox command. Further testing will only occur once a timeout occurs after a 5s delay from the last mailbox command has expired. Link: https://lore.kernel.org/r/20210104180240.46824-14-jsmart2021@gmail.com Co-developed-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-01-07scsi: lpfc: Fix target reset failingJames Smart
Target reset is failed by the target as an invalid command. The Target Reset TMF has been obsoleted in T10 for a while, but continues to be used. On (newer) devices, the TMF is rejected causing the reset handler to escalate to adapter resets. Fix by having Target Reset TMF rejections be translated into a LOGO and re-PLOGI with the target device. This provides the same semantic action (although, if the device also supports nvme traffic, it will terminate nvme traffic as well - but it's still recoverable). Link: https://lore.kernel.org/r/20210104180240.46824-10-jsmart2021@gmail.com Co-developed-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2021-01-07scsi: lpfc: Fix PLOGI S_ID of 0 on pt2pt configJames Smart
Under some pt2pt situations, the other end of the link may issue a LOGO after successfully completing PLOGI and assigning addresses to the port. Thus the driver may attempt a new PLOGI to re-create the login, but the LOGO handling cleared the address back to 0. Once this happens, the other end, which may be address 0, gets all confused and this cannot be resolved without an administrative action to bounce the link. Fix by assuming that address assignment only occurs on the 1st PLOGI after link up, and regardless of login state, the address assignment sticks. The FC standards aren't particularly clear in this situation (it only describes initial PLOGI), but there is nothing that contradicts this and behaviors on the devices tested appears to conform to the understanding. Thus, don't reset the port address to 0 as part of LOGO handling. Port addresses will only reset on link down. Link: https://lore.kernel.org/r/20210104180240.46824-2-jsmart2021@gmail.com Co-developed-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <jsmart2021@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2020-12-01scsi: lpfc: Correct null ndlp reference on routine exitJames Smart
smatch correctly called out a logic error with accessing a pointer after checking it for null: drivers/scsi/lpfc/lpfc_els.c:2043 lpfc_cmpl_els_plogi() error: we previously assumed 'ndlp' could be null (see line 1942) Adjust the exit point to avoid the trace printf ndlp reference. A trace entry was already generated when the ndlp was checked for null. Link: https://lore.kernel.org/r/20201130181226.16675-1-james.smart@broadcom.com Fixes: 4430f7fd09ec ("scsi: lpfc: Rework locations of ndlp reference taking") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2020-11-19scsi: lpfc: Fix set but not used warnings from Rework remote port lock handlingJames Smart
Remove local variables that are set but not used. Link: https://lore.kernel.org/r/20201119203340.121819-1-james.smart@broadcom.com Fixes: c6adba150191 ("scsi: lpfc: Rework remote port lock handling") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2020-11-19scsi: lpfc: Fix memory leak on lcb_contextColin Ian King
Currently there is an error return path that neglects to free the allocation for lcb_context. Fix this by adding a new error free exit path that kfree's lcb_context before returning. Use this new kfree exit path in another exit error path that also kfree's the same object, allowing a line of code to be removed. Link: https://lore.kernel.org/r/20201118141314.462471-1-colin.king@canonical.com Fixes: 4430f7fd09ec ("scsi: lpfc: Rework locations of ndlp reference taking") Reviewed-by: James Smart <james.smart@broadcom.com> Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Addresses-Coverity: ("Resource leak")
2020-11-17scsi: lpfc: Convert abort handling to SLI-3 and SLI-4 handlersJames Smart
This patch reworks the abort interfaces such that SLI-3 retains the iocb-based formatting and completions and SLI-4 now uses native WQEs and completion routines. The following changes are made: - The code is refactored from a confusing 2 routine sequence of xx_abort_iotag_issue(), which creates/formats and abort cmd, and xx_issue_abort_tag(), which then issues and handles the completion of the abort cmd - into a single interface of xx_issue_abort_iotag(). The new interface will determine whether SLI-3 or SLI-4 and then call the appropriate handler. A completion handler can now be specified to address the differences in completion handling. Note: original code is all iocb based, with SLI-4 converting to SLI-3 for the SCSI/ELS path, and NVMe natively using wqes. - The SLI-3 side is refactored: The older iocb-base lpfc_sli_issue_abort_iotag() routine is combined with the logic of lpfc_sli_abort_iotag_issue() as well as the iocb-specific code in lpfc_abort_handler() and lpfc_sli_abort_iocb() to create the new single SLI-3 abort routine that formats and issues the iocb. - The SLI-4 side is refactored and added to: The native WQE abort code in NVMe is moved to the new SLI-4 issue_abort_iotag() routine. Items in SCSI that set fields not set by NVMe is migrated into the new routine. Thus the routine supports NVMe and SCSI initiators. The nvmet block (target) formats the abort slightly different (like the old NVMe initiator) thus it has its own prep routine stolen from NVMe initiator and it retains the current code it has for issuing the WQE (does not use the commonized routine the initiators do). SLI-4 completion handlers were also added. - lpfc_abort_handler now becomes a wrapper that determines whether SLI-3 or SLI-4 and calls the proper abort handler. Link: https://lore.kernel.org/r/20201115192646.12977-16-james.smart@broadcom.com Co-developed-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2020-11-17scsi: lpfc: Fix NPIV Fabric Node reference countingJames Smart
While testing initiator-side cable swaps with NPIV, oops occur. The reference counts for the Fabric nodes on the NPIV vports isn't balanced, resulting in premature node removal. The following fixes were made: - Removed the FC_LBIT check in lpfc_linkup_port. This removed the special case for vports that didn't have them clean up just like the physical port. - Removed the unreg_rpi call in lpfc_cleanup_node. In this section, the node is being removed in the context of a reference count release and a mailbox command can't be issued at this point. - Remove special case handling in the default mailbox completion handler that allowed the skipping of a node reference. Now, reference counting always requires the removal of the reference. - Move the location of the DEVICE_RM event is done during LOGO handling as the driver has additional work to do on the ndlp before puts/releases can be performed. Link: https://lore.kernel.org/r/20201115192646.12977-10-james.smart@broadcom.com Co-developed-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2020-11-17scsi: lpfc: Fix NPIV discovery and Fabric Node detectionJames Smart
While testing NPIV and link bounces, the vport would not show a fabric node for the F_Port, would not transition into NPR state during a link fault, or leave the FDMI node untouched during error injection. Cause for this was determined to be an inconsistent manner in which F_Port, Nameserver, and FDMI controller nodes were created and linked. In some cases, the nodes would never be unregistered from the transport, leaving references active. In other cases, the fabric nodes may register with the transport multiple times while still registered. The following changes were made: - Fix the FDISC issue routine, which starts vport (re)creation, to mark the F_Port as a fabric node (NLP_FABRIC) and allow the F_Port node to fully be created and show up in the node list. - When remote ports are cleaned up on vport termination, cleanup the nameserver and FDMI controller nodes on the vport so they unregister from the transport. - On link bounces, don't exclude the NPIV Fabric remote ports from transitioning to the NPR state, allowing them to avoid re-registration if already registered. Link: https://lore.kernel.org/r/20201115192646.12977-9-james.smart@broadcom.com Co-developed-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2020-11-17scsi: lpfc: Unsolicited ELS leaves node in incorrect state while dropping itJames Smart
When a target swap happens, under certain conditions the node sends a LOGO. The unsolicited ELS logic responds with a reject. The logic may allocate a new node to handle this. Afterward, the new nodes are dropped incorrectly leaving them in a mis-matched state and refcounting causes a use-after-free situation leading to a crash. It is also possible that the unsolicited els handling finds a node which is in an UNUSED state. The handling moves these nodes to NPR state with a refcount of 1. Although the end of the discovery logic assumes a final put will free such a node, there are codes paths which could increment the reference count, thus the node is in NPR state and not released. Eventually this mismatch in state and refcount leads to premature release of the node causing a crash. Fix by always using the discovery engine DEVICE RM event to decrement and release the nodes (rather than explicit code that tried to do it before). This will take care of moving the node to the UNUSED state and then removes the final ref count. If there is a trigger to reuse this node, the transition from the UNUSED state clearly indicates that the initial reference is then incremented and use can continue. Link: https://lore.kernel.org/r/20201115192646.12977-8-james.smart@broadcom.com Co-developed-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>