summaryrefslogtreecommitdiff
path: root/drivers/misc/habanalabs
AgeCommit message (Collapse)Author
2021-09-29habanalabs: fix resetting args in wait for CS IOCTLRajaravi Krishna Katta
In wait for CS IOCTL code, the driver resets the incoming args structure before returning to the user, regardless of the return value of the IOCTL. In case the IOCTL returns EINTR, resetting the args will result in error in case the userspace will repeat the ioctl call immediately (which is the behavior in the hl-thunk userspace library). The solution is to reset the args only if the driver returns success (0) as a return value for the IOCTL. Signed-off-by: Rajaravi Krishna Katta <rkatta@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-09-14habanalabs: expose a single cs seq in staged submissionsOfir Bitton
Staged submission consists of multiple command submissions. In order to be explicit, driver should return a single cs sequence for every cs in the submission, or else user may try to wait on an internal CS rather than waiting for the whole submission. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-09-14habanalabs: fix wait offset handlingfarah kassabri
Add handling for case where the user doesn't set wait offset, and keeps it as 0. In such a case the driver will decrement one from this zero value which will cause the code to wait for wrong number of signals. The solution is to treat this case as in legacy wait cs, and wait for the next signal. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-09-14habanalabs: rate limit multi CS completion errorsOfir Bitton
As user can send wrong arguments to multi CS API, we rate limit the amount of errors dumped to dmesg, in addition we change the severity to warning. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-09-14habanalabs/gaudi: fix LBW RR configurationOded Gabbay
Couple of fixes to the LBW RR configuration: 1. Add missing configuration of the SM RR registers in the DMA_IF. 2. Remove HBW range that doesn't belong. 3. Add entire gap + DBG area, from end of TPC7 to end of entire DBG space. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-09-14habanalabs: Fix spelling mistake "FEADBACK" -> "FEEDBACK"Colin Ian King
There is a spelling mistake in a literal string. Fix it. Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-09-14habanalabs: fail collective wait when not supportedOfir Bitton
As collective wait operation is required only when NIC ports are available, we disable the option to submit a CS in case all the ports are disabled, which is the current situation in the upstream driver. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-09-14habanalabs/gaudi: use direct MSI in single modeOmer Shpigelman
Due to FLR scenario when running inside a VM, we must not use indirect MSI because it might cause some issues on VM destroy. In a VM we use single MSI mode in contrary to multi MSI mode which is used in bare-metal. Hence direct MSI should be used in single MSI mode only. Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-09-14habanalabs: fix kernel OOPs related to staged csfarah kassabri
In case of single staged cs with both first/last indications set, we reach a scenario where in cs_release function flow we don't cancel the TDR work before freeing the cs memory, this lead to kernel OOPs since when the timer expires the work pointer will be freed already. In addition treat wait encaps cs "not found" handle as "OK" for the user in order to keep the user interface for both legacy and encpas signal/wait features the same. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-09-14habanalabs: fix potential race in interrupt wait ioctlOfir Bitton
We have a potential race where a user interrupt can be received in between user thread value comparison and before request was added to wait list. This means that if no consecutive interrupt will be received, user thread will timeout and fail. The solution is to add the request to wait list before we perform the comparison. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-09-01habanalabs/gaudi: hwmon default card nameRajaravi Krishna Katta
This commit corrects CARD NAME for Gaudi as "HL205" Signed-off-by: Rajaravi Krishna Katta <rkatta@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-09-01habanalabs: add support for f/w resetOded Gabbay
When the f/w runs in secured mode, it can reset the ASIC when certain events occur. In unsecured mode, the driver asks the f/w to reset the ASIC for those events. We need to perform the entire reset procedure but without accessing the ASIC. i.e. without halting the engines and without sending messages to the f/w. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-09-01habanalabs/gaudi: block ICACHE_BASE_ADDERESS_HIGH in TPCOded Gabbay
This register shouldn't be modified by user. Prefetch is disabled in Gaudi. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-09-01habanalabs: cannot sleep while holding spinlockfarah kassabri
Fix 2 areas in the code where it's possible the code will go to sleep while holding a spinlock. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-09-01habanalabs: never copy_from_user inside spinlockOded Gabbay
copy_from_user might sleep so we can never call it when we have a spinlock. Moreover, it is not necessary in waiting for user interrupt, because if multiple threads will call this function on the same interrupt, each one will have it's own fence object inside the driver. The user address might be the same, but it doesn't really matter to us, as we only read from it. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-09-01habanalabs: remove unnecessary device status checkOded Gabbay
Checking if the device is operational when entering the function to wait for user interrupt is not something that is useful or necessary. It is not done in any other wait_for_cs ioctl path. If the device becomes non-operational during the wait, the reset function will make sure the process wait is interrupted. Instead, move the check to the beginning of hl_wait_ioctl(). It will block any attempt to wait on CS or user interrupt once the device is already marked as non-operational. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-09-01habanalabs: disable IRQ in user interrupts spinlockOded Gabbay
Because this spinlock is taken in an interrupt handler, we must use the spin_lock_irqsave/irqrestore version to disable the interrupts on the local CPU. Otherwise, we can have a potential deadlock (if the interrupt handler is scheduled to run on the same cpu that the code who took the lock was running on). Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-09-01habanalabs: add "in device creation" statusOmer Shpigelman
On init, the disabled state is cleared right before hw_init and that causes the device to report on "Operational" state before the device initialization is finished. Although the char device is not yet exposed to the user at this stage, the sysfs entries are exposed. This can cause errors in monitoring applications that use the sysfs entries. In order to avoid this, a new state "in device creation" is introduced to ne reported when the device is not disabled but is still in init flow. Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-09-01habanalabs/gaudi: invalidate PMMU mem cache on initOded Gabbay
This must be done to clear the internal mem cache so we won't get ecc errors on the first invalidation. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-09-01habanalabs/gaudi: size should be printed in decimalOded Gabbay
It's more readable for the size to be in decimal. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-09-01habanalabs/gaudi: define DC POWER for secured PMCOded Gabbay
In secured mode, the CGM is disabled. Therefore, the DC power is higher. Without taking it into consideration, the utilization is 12-15% at idle. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-09-01habanalabs/gaudi: unmask out of bounds SLM access interruptTomer Tayar
The out of bounds SLM access TPC interrupt indicates a severe compiler bug and needs to be informed to user. This interrupt is currently masked so unmask it. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-09-01habanalabs: add userptr_lookup node in debugfsYuri Nudelman
It is useful to have the ability to see which user address was pinned to which physical address during the initial mapping. We already have all that info stored, but no means to search this data (which may be quite large). Signed-off-by: Yuri Nudelman <ynudelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-09-01habanalabs/gaudi: fetch TPC/MME ECC errors from F/WOfir Bitton
In case F/W security is enabled driver cannot access ECC registers, hence driver must fetch the ECC info from F/W. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-09-01habanalabs: modify multi-CS to wait on stream mastersOhad Sharabi
During the integration, the multi-CS requirements were refined: - The multi CS call shall wait on "per-ASIC" predefined stream masters instead of set of streams. - Stream masters are set of QIDs used by the upper SW layers (synapse) for completion (must be an external/HW queue). Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-09-01habanalabs/gaudi: add monitored SOBs to state dumpAlon Mizrahi
Current "state dump" is lacking of monitored SOB IDs. Add for convenience. Signed-off-by: Alon Mizrahi <amizrahi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-09-01habanalabs/gaudi: restore user registers when context opensOded Gabbay
Because we don't have multiple contexts in GAUDI, and to minimize calls to is_idle function (which uses many register reads), move the call to clear the user registers to the opening of the single user context. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-09-01habanalabs/gaudi: increase boot fit timeoutOded Gabbay
Various f/w versions have different timeouts, so increase the default timeout to accommodate all the options. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-09-01habanalabs: update to latest firmware headersOded Gabbay
Add several new packets between driver and firmware. Add matching compatibility bits for backward compatibility. Add support for 4K event types. Add information about pcie errors. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-09-01habanalabs/gaudi: minimize number of register readsOded Gabbay
Because the register reads might be trapped by the hypervisor in certain deployments, minimize the number of reads during runtime by moving static initializations to functions that occur during device initialization instead of context open. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-09-01habanalabs: fix mmu node address resolution in debugfsYuri Nudelman
The address resolution via debugfs was not taking into consideration the page offset, resulting in a wrong address. Signed-off-by: Yuri Nudelman <ynudelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-09-01habanalabs: save pid per userptrYuri Nudelman
Currently userptr endpoint in debugfs prints out virtual addresses in the user process memory space, without specifying their owner process ID. User space virtual address is meaningless without knowing the owner process. Signed-off-by: Yuri Nudelman <ynudelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-09-01habanalabs/gaudi: move scrubbing to late initOded Gabbay
HW init is mostly about configuring registers. Therefore, it is better to activate DMAs only in late init and afterwards. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-09-01habanalabs/gaudi: scrub HBM to a specific valueOfir Bitton
In order to enhance debuggability, we will scrub the whole HBM to a specific value, in case HBM scrubbing is enabled. Scrubbing will be performed after reset and after user closes the FD. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-09-01habanalabs: add validity check for event ID received from F/WOfir Bitton
Currently there is no validity check for event ID received from F/W, Thus exposing driver to memory overrun. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-09-01habanalabs: clear msg_to_cpu_reg to avoid misread after resetKoby Elbaz
For some ASICs, the f/w reads the msg_to_cpu_reg value after reset, and for some it doesn't. Therefore, to be sure f/w doesn't read a wrong value after reset, we need to clear this register before the reset occurs. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-09-01habanalabs: make set_pci_regions asic functionOhad Sharabi
In order to better support variants of the same ASIC the set_pci_regions function is now an ASIC function which allows each ASIC to implement it internally, thus keeping all definitions static to the file. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-09-01habanalabs: convert PCI BAR offset to u64Ohad Sharabi
Done as the bar size can exceed 4GB. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-09-01habanalabs: expose server type in INFO IOCTLOded Gabbay
Add the server type property to the hl_info_hw_ip_info structure that is exposed to the user via the INFO IOCTL. This is needed by the userspace s/w stack to know the connections map of the internal links that connect the ASIC among themselves inside the server. The F/W will tell us, as part of the NIC information, the server type that the GAUDI is located in. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-08-29habanalabs: remove redundant warning messageOded Gabbay
This warning is redundant as we will print a notice in case the device is still in use after the FD was closed. No need to print the same message per context. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-08-29habanalabs: add support for encapsulated signals submissionfarah kassabri
This commit is the second part of the encapsulated signals feature. It contains the driver support for submission of cs with encapsulated signals and the wait for them. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-08-29habanalabs: add support for encapsulated signals reservationfarah kassabri
The signaling from within encapsulated OP capability is merged into the existing stream architecture, such that one can trigger multiple signaling from an encapsulated op, according to the time the event was done in the graph execution and avoid the need to wait for the whole encapsulated OP execution to be complete before the stream can signal. This commit implements only the reserve/unreserve part. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-08-29habanalabs: signal/wait change sync object reset flowfarah kassabri
Currently the SOB reset was in fence release function which happens only at the CS wraparound during the CS allocation time. In order to support the new encapsulated signals reservation feature, we need to move the SOB reset to an earlier phase because this SOB could reach it's max value very fast using the signal reservation. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-08-29habanalabs: add wait-for-multi-CS uAPIOhad Sharabi
When user sends multiple CSs, waiting for each CS is not efficient as it involves many user-kernel context switches. In order to address this issue we add support to "wait on multiple CSs" using a new uAPI which can wait on maximum of 32 CSs. The new uAPI is defined using a new flag - WAIT_FOR_MULTI_CS - in the wait_for_cs IOCTL. The input parameters for this uAPI will be: @seq: user pointer to an array of up to 32 CS's sequence numbers. @seq_array_len: length of sequence array. @timeout_us: timeout for waiting for any CS. The output paramateres for this API will be: @status: multi CS ioctl completion status (dedicated status was added as well). @flags: bitmap of output flags of the CS. @cs_completion_map: bitmap for multi CS, if CS sequence that was placed in index N in input seq array has completed- the N-th bit in cs_completion_map will be 1, otherwise it will be 0. @timestamp_nsec: timestamp of the first completed CS Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-08-29habanalabs: get multiple fences under same cs_lockOhad Sharabi
To add proper support for wait-for-multi-CS, locking the CS lock for each CS fence in the list is not efficient. Instead, this patch add support to lock the CS lock once to get all required fences. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-08-29habanalabs: revise prints on FD closeOded Gabbay
The driver quietly handles memory mappings that were not freed so no need to print a warning about that when user closes the FD. Accordingly, revise the text that is printed in case the device is still in use after the user process closed the FD. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-08-29habanalabs/goya: add missing initializationOded Gabbay
Need to initialize f/w Linux loaded indication to false to prevent wrong communication with the f/w. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-08-29habanalabs: update firmware header to latest versionOded Gabbay
Add two new fields regarding interrupts communication between driver and f/w. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-08-29habanalabs: fix race between soft reset and heartbeatKoby Elbaz
There is a scenario where an ongoing soft reset would race with an ongoing heartbeat routine, eventually causing heartbeat to fail and thus to escalate into a hard reset. With this fix, soft-reset procedure will disable heartbeat CPU messages and flush the (ongoing) current one before continuing with reset code. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2021-08-29habanalabs/gaudi: fix information printed on SM eventOded Gabbay
Print the SM name instead of index because it is more informational for the user to know the SM name instead of id when a SM interrupt occurs. In addition, the index that is printed is of the SOB group, not a specific SOB. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>