summaryrefslogtreecommitdiff
path: root/drivers/misc/habanalabs
AgeCommit message (Collapse)Author
2020-11-30habanalabs: refactor mmu va_range db structureOfir Bitton
Use an array of va_ranges instead of keeping each va_range separately, we do this for better readability and in order to support access to a specific range in a much elegant manner. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: move HW dirty check to a proper locationOfir Bitton
Driver must verify if HW is dirty before trying to fetch preboot information. Hence, we move this validation to a prior stage of the boot sequence. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: restore vm_pgoff after mmapOded Gabbay
Due to using dma_mmap_coherent() to perform mmap of dma memory, we had to clear the vm_pgoff field before calling that function. However, that broke the userspace (profiler tool) as they relied on searching the /proc/self/maps for these values to correctly "disassemble" the topology recipe. To re-enable that functionality, the driver can simply restore the value of vm_pgoff before returning to userspace but after calling dma_mmap_coherent(). Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: add 'needs reset' state in driverOfir Bitton
The new state indicates that device should be reset in order to re-gain funcionality. This unique state can occur if reset_on_lockup is disabled and an actual lockup has occurred. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: fix hard reset print and commentOmer Shpigelman
One of the first steps of a hard reset flow is to close all open user contexts. This user process teradown might take some time due to long cleanup in our driver or some other reason even before our cleanup flow. Hence fix the relevant print and comment to be more accurate. Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs/gaudi: remove pcie_en strap toggleIgor Grinberg
Since the very large grace period is over and this functionality prevents us to implement the new reset sequence and apply security settings, we need to remove the code toggling the PCIE_EN bit in the straps register. Remove it for good. Signed-off-by: Igor Grinberg <igrinberg@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: remove duplicate printOded Gabbay
We print twice the firmware status regarding security, once in common code and once in asic code. Remove the print in asic code and leave the common code print. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: Separate CS job completion from its deallocationTomer Tayar
Current CS jobs are no longer needed after their completion. However, jobs of future workload might be in use even after they are completed. To allow that, the patch adds a refcount to the job object, and decouples its completion handling from its deallocation. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs/gaudi: increase MAX CS to 16KOded Gabbay
We need to have the MAX CS be much larger than the size of the different queues. In GAUDI we have around 8 groups of queues, and each group has 1K queue size. To prevent head-of-the-line blocking, we need to make sure there is sufficient number of available CS allocations even if one or more of those queues are full. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: reset device upon fw read failurefarah kassabri
failure in reading pre-boot verion is not handled correctly, upon failure we need to reset the device in order to be able to reinstall the driver. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: Move repeatedly included headers to habanalabs.hTomer Tayar
Several header files are repeatedly included in many files. Move these files to habanalabs.h which is included by all. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: release signal if collective wait was droppedOfir Bitton
As in standard wait cs, we must release a signal fence once a collective wait cs was dropped and not submitted. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: Skip updating CI of internal queues if not in useTomer Tayar
There are no internal queues if H/W queues are being used. In this case we can skip the redundant traversal over the queues array, looking for internal queues. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: Small refactoring of cs_do_release()Tomer Tayar
Slightly refactor the cs_do_release() function, to reduce nesting level and to ease the handling of future CS types. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: Small refactoring of CS IOCTL handlingTomer Tayar
Refactor the CS IOCTL handling by gathering common code into sub-functions, in order to ease future additions of new CS types. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs/gaudi: fetch PLL info from FWOfir Bitton
Once FW security is enabled there is no access to PLL registers, need to read values from FW using a dedicated interface. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: refactor MMU to support dual residency MMUMoti Haimovski
This commit refactors the MMU code to support PCI MMU page tables residing on host and DCORE MMU residing on the device DRAM at the same time. This is needed for future devices as on GAUDI and GOYA we have a single MMU where its page tables always reside on DRAM. Signed-off-by: Moti Haimovski <mhaimovski@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: fix MMU print messageMoti Haimovski
This commit fixes an incorrect error message Signed-off-by: Moti Haimovski <mhaimovski@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs/gaudi: scrub all memory upon closing FDfarah kassabri
In cases of multi-tenants, administrators may want to prevent data leakage between users running on the same device one after another. To do that the driver can scrub the internal memory (both SRAM and DRAM) after a user finish to use the memory. Because in GAUDI the driver allows only one application to use the device at a time, it can scrub the memory when user app close FD. In future devices where we have MMU on the DRAM, we can scrub the DRAM memory with a finer granularity (page granularity) when the user allocates the memory. This feature is not supported in Goya. To allow users that want to debug their applications, we add a kernel module parameter to load the driver with this feature disabled. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs/gaudi: add support for FW securityOfir Bitton
Skip relevant HW configurations once FW security is enabled because these configurations are being performed by FW. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: fetch security indication from FWOfir Bitton
Add support for fetching security indication from FW. This indication is needed in order to skip unnecessary initializations done by FW. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: fix cs counters structurefarah kassabri
Fix cs counters structure in uapi to be one flat structure instead of two instances of the same other structure. use atomic read/increment for context counters so we could use one structure for both aggregated and context counters. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: advanced FW loadingOfir Bitton
Today driver is able to load a whole FW binary into a specific location on ASIC. We add support for loading sections from the same FW binary into different loactions. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: initialize variable before useOded Gabbay
GCC 7.3.1 20180303 (Red Hat 7.3.1-5) complains that collective_engine_id might be used uninitialized. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs/gaudi: remove unreachable codeOfir Bitton
Remove unreachable code in gaudi collective flow. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: make sure cs type is valid in cs_ioctl_signal_waitOded Gabbay
Although we get a valid cs type from the callee, in case new values will be added in the future, it is best to check the expected values in that function. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs/gaudi: monitor device memory usageOded Gabbay
In GAUDI we don't have an MMU towards the HBM device memory. Therefore, the user access that memory directly through physical address (via the different engines) without the need to go through the driver to allocate/free memory on the HBM. For system monitoring purposes, the driver will keep track of the HBM usage. This can be done as long as the user accurately reports the allocations and releases of HBM memory, through the existing MEMORY IOCTL uapi. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: sync stream collective supportOfir Bitton
Implement sync stream collective for GAUDI. Need to allocate additional resources for that and add ctx_fini() to clean up those resources. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs/gaudi: Set DMA5 QMAN internalOfir Bitton
DMA5 QMAN is designated to be used for reduction process, hence it will be no longer configured as external queue. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: sync stream collective infrastructureOfir Bitton
Define new API for collective wait support and modify sync stream common flow. In addition add kernel CB allocation support for internal queues. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: use enum for CB allocation optionsTal Cohen
In the future there will be situations where queues can accept either kernel allocated CBs or user allocated CBs, depending on different states. Therefore, instead of using a boolean variable of kernel/user allocated CB, we need to use a bitmask to indicate that, which will allow to combine the two options. Add a flag to the uapi so the user will be able to indicate whether the CB was allocated by kernel or by user. Of course the driver validates that. Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs/gaudi: add support for NIC QMANsOded Gabbay
Initialize the QMANs that are responsible to submit doorbells to the NIC engines. Add support for stopping and disabling them, and reset them as part of the hard-reset procedure of GAUDI. This will allow the user to submit work to the NICs. Add support for receiving events on QMAN errors from the firmware. However, the nic_ports_mask is still initialized to 0. That means this code won't initialize the QMANs just yet. That will be in a later patch. Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs/gaudi: add NIC security configurationOded Gabbay
Configure the security properties of the NIC IP. This is to prevent the user process from doing something with the NIC that he shouldn't do. e.g. crash the server, steal data, etc. Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs/gaudi: add NIC firmware-related definitionsOded Gabbay
Add new structures and messages that the driver use to interact with the firmware to receive information and events (errors) about GAUDI's NIC. Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs/gaudi: add NIC QMAN H/W and registers definitionsOded Gabbay
Add auto-generated header files that describe the NIC QMANs registers used by the driver. Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: remove duplicate checkOded Gabbay
We already check if queue index is smaller than max queues a few lines above this check so no need to check this again. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: sync stream refactor functionsOfir Bitton
Refactor sync stream implementation by reducing function length for better readability. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: add support for multiple SOBs per monitorOfir Bitton
Support advanced monitor functionality to monitor more than a single SOB. In addition expand all CB generation functions with buffer offset in order to put in them multiple packets that are generated by different functions. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: sync stream structures refactorOfir Bitton
Refactor sync stream implementation by adding more structures for better readability. In addition reducing allocated resources. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: don't init vm module if no MMUOded Gabbay
In case we are running without MMU enabled (debug mode), no need to initialize the VM module in the driver. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: minimize prints when everything is fineOded Gabbay
No need to print when the driver starts to initialize the H/W. Drivers should be silent when everything is OK. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: support multiple types of firmwaresOded Gabbay
The driver now loads the firmware in two stages. For debugging purposes we need to support situations where only the first stage firmware is loaded. Therefore, use a bitmask to determine which F/W is loaded Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: we need CPU queues for hwmonOded Gabbay
F/W can be loaded but device CPU queues disabled. In that case, HWMON should be disabled. This is only relevant when debugging Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs/gaudi: move mmu_prepare to context initOfir Bitton
Currently mmu_prepare is located at context switch. Since we support a single context, no reason to reconfigure the MMU registers every context switch. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: change aggregate cs counters to atomicOded Gabbay
In case we will have multiple contexts/processes, we can't just increment aggregated counters. We need to make them atomic as they can be incremented by multiple processes Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: put devices before driver removalOfir Bitton
Driver never puts its device and control_device objects, hence a memory leak is introduced every driver removal. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30habanalabs: free host huge va_range if not usedOfir Bitton
If huge range is not valid, driver uses the host range also for huge page allocations, but driver never frees its allocation. This introduces a memory leak every time a user closes its context. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-23habanalabs/gaudi: fix missing code in ECC handlingOded Gabbay
There is missing statement and missing "break;" in the ECC handling code in gaudi.c This will cause a wrong behavior upon certain ECC interrupts. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-04habanalabs/gaudi: mask WDT error in QMANOded Gabbay
This interrupt cause is not relevant because of how the user use the QMAN arbitration mechanism. We must mask it as the log explodes with it. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-04habanalabs/gaudi: move coresight mmu configOfir Bitton
We must relocate the coresight mmu configuration to the coresight flow to make it work in case the first submission is to configure the profiler. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>