summaryrefslogtreecommitdiff
path: root/drivers/misc/habanalabs
AgeCommit message (Collapse)Author
2022-11-23habanalabs/gaudi2: capture RAZWI informationDani Liberman
Added function to calculate possible engines which caused RAZWI (read-only zero, write ignored), from a given router id or module index. When getting RAZWI via PSOC IP, first the router id is calculated and then the possible engines that caused the RAZWI are calculated. There is a possibility that the RAZWI initiator is not an engine. In that case, it will not be included in possible engines as it doesn't have an engine id. RAZWI information is captured when receiving event from engine or via PSOC IP. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-11-23habanalabs: handle HBM MMU when capturing page fault dataDani Liberman
In case of HBM MMU page fault, capture its relevant mappings. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-11-23habanalabs: move reset workqueue to be under hl_deviceTomer Tayar
'struct hl_device_reset_work' is used as a wrapper for the reset work and its parameters, including the reset workqueue on which it runs. In a future commit, another reset related work with similar parameters is going to be added, but it won't use the reset workqueue. As in any case there is a single reset workqueue, and to allow the resue of this structure, move the reset workqueue to 'struct hl_device'. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-11-23habanalabs: allow unregistering eventfd when device non-operationalTomer Tayar
Unregistering eventfd is for releasing host resources and doesn't involve an access to the device. As such, there is no reason to disallow it when device isn't operational. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-11-23habanalabs: skip idle status check if reset on device releaseTomer Tayar
If reset upon device release is enabled, there is no need to check the device idle status in hpriv_release(), because device is going to be reset in any case. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-11-23habanalabs/gaudi2: add device unavailable notificationTal Cohen
Device unavailable notifies the user that there isn't an option to retrieve debug information from the device. When a critical device error occurs and the f/w performs the device reset, a device unavailable notification shall be sent to the user process. Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-11-23habanalabs/gaudi2: remove privileged MME clock configurationKoby Elbaz
Privileged MME clock configuration is removed as it is done by the f/w. Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-11-23habanalabs: replace 'pf' to 'prefetch'Dafna Hirschfeld
pf was an abbreviation for prefetch but because pf already stands for 'physical function', we decided to change it to 'prefetch'. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-11-23habanalabs: add page fault info uapiDani Liberman
Only the first page fault will be saved. Besides the address which caused the page fault, the driver captures all of the mmu user mappings. User can retrieve this data via the new uapi (new opcode in INFO ioctl). Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-11-23habanalabs/gaudi2: fix module ID for RAZWI handlingTomer Tayar
RAZWI is optionally handled as part of the generic QM SEI error handling, but it always uses PDMA as the module ID. Fix it to use the suitable module ID according to the specific event. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-11-23habanalabs: use lower_32_bits()Bharat Jauhari
This fixes sparse warning on doing cast to 32-bits Signed-off-by: Bharat Jauhari <bjauhari@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-11-23habanalabs: refactor razwi event notificationDani Liberman
This event notification was compatible only with gaudi, where razwi and page fault happens together. To make it compatible with all ASICs, this refactor contains: 1. Razwi notification will only notify about razwi info. New notification will be added in future patch, to retrieve data about page fault error. 2. Changed razwi info structure to support all ASICs. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-11-23habanalabs: Use simplified API for p2p dist calcOded Gabbay
Use the simplified API that calculates distance between two devices. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-11-23habanalabs: allow control device open during resetOfir Bitton
Monitoring apps would like to query device state at any time so we should allow it also during reset because it doesn't involve accessing the h/w. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-11-23habanalabs: fix return value check in hl_fw_get_sec_attest_data()Yang Yingliang
If hl_cpu_accessible_dma_pool_alloc() fails, we should check 'req_cpu_addr', fix it. Fixes: 0c88760f8f5e ("habanalabs/gaudi2: add secured attestation info uapi") Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-10-11treewide: use get_random_u32() when possibleJason A. Donenfeld
The prandom_u32() function has been a deprecated inline wrapper around get_random_u32() for several releases now, and compiles down to the exact same code. Replace the deprecated wrapper with a direct call to the real function. The same also applies to get_random_int(), which is just a wrapper around get_random_u32(). This was done as a basic find and replace. Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Yury Norov <yury.norov@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake Acked-by: Chuck Lever <chuck.lever@oracle.com> # for nfsd Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> # for thunderbolt Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs Acked-by: Helge Deller <deller@gmx.de> # for parisc Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390 Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-09-20habanalabs: eliminate aggregate use warningOded Gabbay
When doing sizeof() and giving as argument a dereference of a pointer-to-a-pointer object, clang will issue a warning. Eliminate the warning by passing struct <name>* Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-20habanalabs/gaudi: use 8KB aligned address for TPC kernelsTomer Tayar
I$ prefetch is enabled when sending a TPC kernel to initialize the TPC memory, and it has a restriction that the base address will be aligned to 8KB. Currently the base address is 128 bytes from the start address of the device SRAM, so prefetching will start 128 bytes before the actual kernel memory. Modify the kernel address to be 8KB aligned. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-20habanalabs: remove some f/w descriptor validationsfarah kassabri
To be forward-backward compatible with the firmware in the initial communication during preboot, we need to remove the validation of the header size. This will allow us to add more fields to the lkd_fw_comms_desc structure. Instead of the validation of the header size, we just print warning when some mismatch in descriptor has been revealed, and we calculate the CRC base on descriptor size reported by the firmware instead of calculating it ourselves. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-20habanalabs: build ASICs from new to oldOhad Sharabi
Newer ASICs code changes more often, has more chance to fail compilation. So, let's compile them first so errors in those files will fail compilation sooner. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19habanalabs/gaudi2: allow user to flush PCIE by readOfir Bitton
In order for the user to flush PCIE he needs to read some register from PCIE block. The chosen register is SPECIAL_GLBL_SPARE_0 and hence needs to be unsecured. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19habanalabs: failure to open device due to reset is debug levelOded Gabbay
If the user wants to open the device, and the device is currently in reset, the user will get an error from the open(). We don't need to display an error in the dmesg for that as it is not a real error and we can spam the kernel log with this message. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19habanalabs/gaudi2: Remove unnecessary (void*) conversionsLi zeming
The void pointer object can be directly assigned to different structure objects, it does not need to be cast. Signed-off-by: Li zeming <zeming@nfschina.com> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19habanalabs/gaudi2: add secured attestation info uapiDani Liberman
User will provide a nonce via the ioctl, and will retrieve secured attestation data of the boot, generated using given nonce. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19habanalabs/gaudi2: add handling to pmmu events in eqe handlerDani Liberman
In order to get the error cause and the captured address in case of page fault, added pmmu events to eqe handler. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19habanalabs/gaudi: change TPC Assert to use TPC DEC instead of QMAN errTal Cohen
This change is done while there is a problem to use QMAN error for TPC assert async. The problem involves security limitation that exists to generate the assert via QMAN error. Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19habanalabs: rename error info structureDani Liberman
As a preparation for adding more errors to it, change to more suitable name. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19habanalabs/gaudi2: get f/w reset status register dynamicallyfarah kassabri
Get the firmware reset status address from the dynamic registers we read from the firmware instead of using a define. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19habanalabs/gaudi2: increase hard-reset sleep time to 2 secTomer Tayar
The access to the device registers is blocked during hard reset, until preboot runs and allows the access to specific registers, including the PSOC BTM_FSM register which is used to know when the reset is done. Between the reset request and until this register is polled there is a small delay of 500 msec which is not enough for F/W to process the reset and for preboot to run, so the register might be accessed while it is blocked. To avoid it, increase the delay to 2 sec. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19habanalabs/gaudi2: print RAZWI info upon PCIe access errorTomer Tayar
Add the dump of the RAZWI information when a PCIe access is blocked by RR. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19habanalabs: MMU invalidation h/w is per deviceOded Gabbay
The code used the mmu mutex to protect access to the context's page tables and invalidation of the MMU cache. Because pgt are per context, the mmu mutex was a member of the context object. The problem is that the device has a single MMU invalidation h/w (per MMU). Therefore, the mmu mutex should not be a property of the context but a property of the device. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19habanalabs: new notifier events for device stateTal Cohen
Add new notifier events that inform several device states. General H/W error raised on device general H/W error occurs. User engine error is raised when a device engine informs of an error. Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19habanalabs/gaudi2: free event irq if init failsOded Gabbay
In case initialization fails after event irq was requested, we need to release that irq. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19habanalabs: fix resetting the DRAM BAROhad Sharabi
Current code does not takes into account the new DRAM region base and so calculated address is wrong and can lead to crush. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19habanalabs: add support for new cpucp return codesOfir Bitton
Firmware now responds with a more detailed cpucp return codes. Driver can now distinguish between error and debug return codes. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19habanalabs/gaudi2: read F/W security indication after hard resetTomer Tayar
F/W security status might change after every reset. Add the reading of the preboot status to the hard reset sequence, which among others reads this security indication. As this preboot status reading includes the waiting for the preboot to be ready, it can be removed from the CPU init which is done in a later stage. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19habanalabs/gaudi: rename mme cfg error response printOfir Bitton
Current description is misleading hence we rename it to a more suitable error description. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19habanalabs: fix possible hole in device vafarah kassabri
cb_map_mem() uses gen_pool_alloc() to get virtual address for mapping a CB. The mapping is done in chunks of page size, so if the CB size is larger, it is possible that the allocated virtual addresses won't be consecutive. User retrieves this device VA which returns the virtual address in the first va_block. If there is a "hole" in the virtual addresses, user can configure a HW block with a bad device VA. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19habanalabs: send device activity in a proper contextOfir Bitton
'Device activity open packet' should be sent outside of mutex as there is no real necessity for a lock. In addition 'device activity close packet' should be sent upon an actual release of the device. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19habanalabs: send device active message to f/wfarah kassabri
As part of the RAS that is done by the f/w, we should send a message to the f/w when a user either acquires or releases the device. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-19habanalabs/gaudi2: dump detailed information upon RAZWIOfir Bitton
In order to improve debuggability, we add all available information when a RAZWI event occur. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18habanalabs/gaudi2: log critical events with no rate limitfarah kassabri
When we have a storm of errors of HBM ECC SERR we can reach a situation where driver start hard reset flow without logging the error cause that caused the hard reset due to logs rate limiting. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18habanalabs: ignore EEPROM errors during bootOfir Bitton
EEPROM errors reported by firmware are basically warnings and should not fail the boot process. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18habanalabs: perform context switch flow only if neededOfir Bitton
Except Goya, none of our ASICs require context switch flow, hence we enable this flow only where it is needed. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18habanalabs: set command buffer host VA dynamicallyDafna Hirschfeld
Set the addresses for userspace command buffer dynamically instead of hard-coded. There is no reason for it to be hard-coded. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18habanalabs: trace DMA allocationsOhad Sharabi
This patch add tracepoints in the code for DMA allocation. The main purpose is to be able to cross data with the map operations and determine whether memory violation occurred, for example free DMA allocation before unmapping it from device memory. To achieve this the DMA alloc/free code flows were refactored so that a single DMA tracepoint will catch many flows. To get better understanding of what happened in the DMA allocations the real allocating function is added to the trace as well. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18habanalabs: trace MMU map/unmap pageOhad Sharabi
This patch utilize the defined tracepoint to trace the MMU's pages map/unmap operations. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18habanalabs: define trace eventsOhad Sharabi
This patch adds trace events for habanalabs driver to gain all the benefits such an infrastructure can supply. The following events were added: - MMU map/unmap: to be able to track driver's memory allocations - DMA alloc/free: to track our DMA allocation the above trace points in conjunction will help us map the device memory usage as well as to be able to track memory violations. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Acked-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18habanalabs/gaudi2: assigning PQFs for ARC f/w in PDMARajarama Manjukody Bhat
Assigning 3 PQFs in PDMA1 and 2 PQFs in PDMA0 for ARC firmware usage. Signed-off-by: Rajarama Manjukody Bhat <rmbhat@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-09-18habanalabs: fix calculation of DRAM base address in PCIe BARTomer Tayar
The calculation of the device DRAM base address before setting the relevant PCIe BAR to point at it, has an assumption that this BAR is used to access only the DRAM, and thus the covered DRAM size is a power of 2. In future ASICs it is not necessarily true, so need to update the calculation to support also a non-power-of-2 size. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>