summaryrefslogtreecommitdiff
path: root/drivers/misc/habanalabs/goya/goya.c
AgeCommit message (Collapse)Author
2019-03-24habanalabs: remove low credit limit of DMA #0Oded Gabbay
Because DMA #0 is now used by the user, remove the limitation of credits from this channel. Without this patch, this channel is pretty much unusable due to its very low bandwidth configuration. Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2019-03-03habanalabs: cast to expected typeOded Gabbay
This patch fix the following sparse warning: drivers/misc/habanalabs/goya/goya.c:3646:14: warning: incorrect type in assignment (different address spaces) drivers/misc/habanalabs/goya/goya.c:3646:14: expected void *base drivers/misc/habanalabs/goya/goya.c:3646:14: got void [noderef] <asn:2> * Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2019-03-03habanalabs: prevent host crash during suspend/resumeOded Gabbay
This patch fixes the implementation of suspend/resume of the device so that upon resume of the device, the host won't crash due to PCI completion timeout. Upon suspend, the device is being reset due to PERST. Therefore, upon resume, the driver must initialize the PCI controller as if the driver was loaded. If the controller is not initialized and the device tries to access the device through the PCI bars, the host will crash with PCI completion timeout error. Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2019-03-02habanalabs: use %px instead of %p in error printOded Gabbay
When parsing the address of an internal command buffer, the driver prints an error if the buffer's address is not in the range of the device's DRAM or SRAM memory address space. Use %px to print the real address that the user gave the driver and not a hashed value, so the user will get a clue regarding the origin of his error. Note that if the print occurs, the pointer that is printed is a user's virtual address and not some kind of physical address. Suggested-by: Joe Perches <joe@perches.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-28habanalabs: fix little-endian<->cpu conversion warningsTomer Tayar
Add __cpu_to_le16/32/64 and __le16/32/64_to_cpu where needed according to sparse. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-28habanalabs: soft-reset device if context-switch failsOded Gabbay
This patch fix a bug in the driver, where if the TPC or MME remains in non-IDLE even after all the command submissions are done (due to user bug or malicious user), then future command submissions will fail in the context-switch stage and the driver will remain in "stuck" mode. The fix is to do a soft-reset of the device in case the context-switch fails, because the device should be IDLE during context-switch. If it is not IDLE, then something is wrong and we should reset the compute engines. Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-28habanalabs: print pointer using %pOded Gabbay
Don't cast pointer to u64 to print it. Instead, print the pointer using %p. Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-28habanalabs: extend QMAN0 job timeoutOmer Shpigelman
This patch fix a bug where the timeout for sending a job on QMAN0 by KMD wasn't enough in palladium environment. Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-28habanalabs: set DMA0 completion to SOB 1007Oded Gabbay
This patch fix a bug where DMA channel 0 completion address wasn't initialized by the driver. The patch sets the address to Sync Object no. 1007 Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-28habanalabs: fix validation of WREG32 to DMA completionOded Gabbay
This patch fix a bug in the validation of WREG32 in DMA queues. The validation was too strict. It allowed the user to set the completion address only for DMA channel 1. The fix allows the user to set the completion address for all 5 DMA channels. Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-28habanalabs: fix mmu cache registers initOded Gabbay
This patch fix an incorrect initialization of the MMU cache registers. The shift operation was done in the wrong direction. Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-28habanalabs: disable CPU access on timeoutsOded Gabbay
This patch provides a workaround for a bug in the F/W where the response time for a request from KMD may take more then 100ms. This could cause the queue between KMD and the F/W to get out of sync. The WA is to: 1. Increase the timeout of ALL requests to 1s. 2. In case a request isn't answered in time, mark the state as "cpu_disabled" and prevent sending further requests from KMD to the F/W. This will eventually lead to a heartbeat failure and hard reset of the device. Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-28habanalabs: add MMU DRAM default page mappingOmer Shpigelman
This patch provides a workaround for a H/W bug in Goya, where access to RAZWI from TPC can cause PCI completion timeout. The WA is to use the device MMU to map any unmapped DRAM memory to a default page in the DRAM. That way, the TPC will never reach RAZWI upon accessing a bad address in the DRAM. When a DRAM page is mapped by the user, its default mapping is overwritten. Once that page is unmapped, the MMU driver will map that page to the default page. To help debugging, the driver will set the default page area to 0x99 on device initialization. Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-28habanalabs: Dissociate RAZWI info from event typesTomer Tayar
This patch provides a workaround for a H/W bug in the RAZWI logger in Goya. The logger doesn't recognize the initiator correctly and as a result, accesses from one initiator are reported that were coming from a different initiator. The WA is to print the error information from the event entries we receive without looking at the RAZWI logger at all. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-27habanalabs: make functions static or declare themOded Gabbay
This patch fixes the below sparse warnings by either making the functions static or by adding a declaration in the relevant header file. In addition, the patch removes goya_mmap completely as it doesn't add any additional benefit. Fixes the following sparse warnings: drivers/misc/habanalabs/habanalabs_drv.c:24:1: warning: symbol 'hl_devs_idr' was not declared. Should it be static? drivers/misc/habanalabs/habanalabs_drv.c:25:1: warning: symbol 'hl_devs_idr_lock' was not declared. Should it be static? drivers/misc/habanalabs/memory.c:1451:5: warning: symbol 'hl_vm_ctx_init_with_ranges' was not declared. Should it be static? drivers/misc/habanalabs/goya/goya.c:396:5: warning: symbol 'goya_send_pci_access_msg' was not declared. Should it be static? drivers/misc/habanalabs/goya/goya.c:417:5: warning: symbol 'goya_pci_bars_map' was not declared. Should it be static? drivers/misc/habanalabs/goya/goya.c:557:6: warning: symbol 'goya_reset_link_through_bridge' was not declared. Should it be static? drivers/misc/habanalabs/goya/goya.c:774:5: warning: symbol 'goya_early_fini' was not declared. Should it be static? drivers/misc/habanalabs/goya/goya.c:857:6: warning: symbol 'goya_late_fini' was not declared. Should it be static? drivers/misc/habanalabs/goya/goya.c:971:5: warning: symbol 'goya_sw_fini' was not declared. Should it be static? drivers/misc/habanalabs/goya/goya.c:1233:5: warning: symbol 'goya_init_cpu_queues' was not declared. Should it be static? drivers/misc/habanalabs/goya/goya.c:2914:5: warning: symbol 'goya_suspend' was not declared. Should it be static? drivers/misc/habanalabs/goya/goya.c:2939:5: warning: symbol 'goya_resume' was not declared. Should it be static? drivers/misc/habanalabs/goya/goya.c:2952:5: warning: symbol 'goya_mmap' was not declared. Should it be static? drivers/misc/habanalabs/goya/goya.c:2957:5: warning: symbol 'goya_cb_mmap' was not declared. Should it be static? drivers/misc/habanalabs/goya/goya.c:2973:6: warning: symbol 'goya_ring_doorbell' was not declared. Should it be static? drivers/misc/habanalabs/goya/goya.c:3063:6: warning: symbol 'goya_flush_pq_write' was not declared. Should it be static? drivers/misc/habanalabs/goya/goya.c:3068:6: warning: symbol 'goya_dma_alloc_coherent' was not declared. Should it be static? drivers/misc/habanalabs/goya/goya.c:3074:6: warning: symbol 'goya_dma_free_coherent' was not declared. Should it be static? drivers/misc/habanalabs/goya/goya.c:3080:6: warning: symbol 'goya_get_int_queue_base' was not declared. Should it be static? drivers/misc/habanalabs/goya/goya.c:3138:5: warning: symbol 'goya_send_job_on_qman0' was not declared. Should it be static? drivers/misc/habanalabs/goya/goya.c:3295:5: warning: symbol 'goya_test_queue' was not declared. Should it be static? drivers/misc/habanalabs/goya/goya.c:3417:6: warning: symbol 'goya_dma_pool_zalloc' was not declared. Should it be static? drivers/misc/habanalabs/goya/goya.c:3426:6: warning: symbol 'goya_dma_pool_free' was not declared. Should it be static? drivers/misc/habanalabs/goya/goya.c:3432:6: warning: symbol 'goya_cpu_accessible_dma_pool_alloc' was not declared. Should it be static? drivers/misc/habanalabs/goya/goya.c:3448:6: warning: symbol 'goya_cpu_accessible_dma_pool_free' was not declared. Should it be static? drivers/misc/habanalabs/goya/goya.c:3458:5: warning: symbol 'goya_dma_map_sg' was not declared. Should it be static? drivers/misc/habanalabs/goya/goya.c:3467:6: warning: symbol 'goya_dma_unmap_sg' was not declared. Should it be static? drivers/misc/habanalabs/goya/goya.c:3473:5: warning: symbol 'goya_get_dma_desc_list_size' was not declared. Should it be static? drivers/misc/habanalabs/goya/goya.c:4210:5: warning: symbol 'goya_parse_cb_no_mmu' was not declared. Should it be static? drivers/misc/habanalabs/goya/goya.c:4261:5: warning: symbol 'goya_parse_cb_no_ext_quque' was not declared. Should it be static? drivers/misc/habanalabs/goya/goya.c:4294:5: warning: symbol 'goya_cs_parser' was not declared. Should it be static? drivers/misc/habanalabs/goya/goya.c:4307:6: warning: symbol 'goya_add_end_of_cb_packets' was not declared. Should it be static? drivers/misc/habanalabs/goya/goya.c:4334:5: warning: symbol 'goya_context_switch' was not declared. Should it be static? drivers/misc/habanalabs/goya/goya.c:4426:6: warning: symbol 'goya_restore_phase_topology' was not declared. Should it be static? drivers/misc/habanalabs/goya/goya.c:4460:5: warning: symbol 'goya_debugfs_read32' was not declared. Should it be static? drivers/misc/habanalabs/goya/goya.c:4510:5: warning: symbol 'goya_debugfs_write32' was not declared. Should it be static? drivers/misc/habanalabs/goya/goya.c:4738:6: warning: symbol 'goya_handle_eqe' was not declared. Should it be static? drivers/misc/habanalabs/goya/goya.c:4836:6: warning: symbol 'goya_get_events_stat' was not declared. Should it be static? drivers/misc/habanalabs/goya/goya.c:5075:5: warning: symbol 'goya_send_heartbeat' was not declared. Should it be static? drivers/misc/habanalabs/goya/goya.c:5253:5: warning: symbol 'goya_get_eeprom_data' was not declared. Should it be static? Reported-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-26habanalabs: use u64 when comparing variables' sum to u32_maxOded Gabbay
This patch fixes two smatch warnings about two if statements that are always true because of the types of the variables used - u32 when comparing the sum to u32_max. The patch changes the types to be u64 so the accumalted sum can be checked if it is larger than u32_max Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-23habanalabs: don't print result when rc indicates errorOded Gabbay
send_cpu_message() doesn't update the result parameter when an error occurs in its code. Therefore, callers of send_cpu_message() shouldn't use the result value when the return code indicates error. This patch fixes a static checker warning in goya_test_cpu_queue(), where that function did print the result even though the return code from send_cpu_message() indicated error. Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-18habanalabs: add debugfs supportOded Gabbay
This patch adds debugfs support to the driver. It allows the user-space to display information that is contained in the internal structures of the driver, such as: - active command submissions - active user virtual memory mappings - number of allocated command buffers It also enables the user to perform reads and writes through Goya's PCI bars. Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-18habanalabs: implement INFO IOCTLOded Gabbay
This patch implements the INFO IOCTL. That IOCTL is used by the user to query information that is relevant/needed by the user in order to submit deep learning jobs to Goya. The information is divided into several categories, such as H/W IP, Events that happened, DDR usage and more. Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-18habanalabs: add virtual memory and MMU modulesOmer Shpigelman
This patch adds the Virtual Memory and MMU modules. Goya has an internal MMU which provides process isolation on the internal DDR. The internal MMU also performs translations for transactions that go from Goya to the Host. The driver is responsible for allocating and freeing memory on the DDR upon user request. It also provides an interface to map and unmap DDR and Host memory to the device address space. The MMU in Goya supports 3-level and 4-level page tables. With 3-level, the size of each page is 2MB, while with 4-level the size of each page is 4KB. In the DDR, the physical pages are always 2MB. Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-18habanalabs: add command submission moduleOded Gabbay
This patch adds the main flow for the user to submit work to the device. Each work is described by a command submission object (CS). The CS contains 3 arrays of command buffers: One for execution, and two for context-switch (store and restore). For each CB, the user specifies on which queue to put that CB. In case of an internal queue, the entry doesn't contain a pointer to the CB but the address in the on-chip memory that the CB resides at. The driver parses some of the CBs to enforce security restrictions. The user receives a sequence number that represents the CS object. The user can then query the driver regarding the status of the CS, using that sequence number. In case the CS doesn't finish before the timeout expires, the driver will perform a soft-reset of the device. Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-18habanalabs: add device reset supportOded Gabbay
This patch adds support for doing various on-the-fly reset of Goya. The driver supports two types of resets: 1. soft-reset 2. hard-reset Soft-reset is done when the device detects a timeout of a command submission that was given to the device. The soft-reset process only resets the engines that are relevant for the submission of compute jobs, i.e. the DMA channels, the TPCs and the MME. The purpose is to bring the device as fast as possible to a working state. Hard-reset is done in several cases: 1. After soft-reset is done but the device is not responding 2. When fatal errors occur inside the device, e.g. ECC error 3. When the driver is removed Hard-reset performs a reset of the entire chip except for the PCI controller and the PLLs. It is a much longer process then soft-reset but it helps to recover the device without the need to reboot the Host. After hard-reset, the driver will restore the max power attribute and in case of manual power management, the frequencies that were set. This patch also adds two entries to the sysfs, which allows the root user to initiate a soft or hard reset. Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-18habanalabs: add sysfs and hwmon supportOded Gabbay
This patch add the sysfs and hwmon entries that are exposed by the driver. Goya has several sensors, from various categories such as temperature, voltage, current, etc. The driver exposes those sensors in the standard hwmon mechanism. In addition, the driver exposes a couple of interfaces in sysfs, both for configuration and for providing status of the device or driver. The configuration attributes is for Power Management: - Automatic or manual - Frequency value when moving to high frequency mode - Maximum power the device is allowed to consume The rest of the attributes are read-only and provide the following information: - Versions of the various firmwares running on the device - Contents of the device's EEPROM - The device type (currently only Goya is supported) - PCI address of the device (to allow user-space to connect between /dev/hlX to PCI address) - Status of the device (operational, malfunction, in_reset) - How many processes are open on the device's file Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-18habanalabs: add event queue and interruptsOded Gabbay
This patch adds support for receiving events from Goya's control CPU and for receiving MSI-X interrupts from Goya's DMA engines and CPU. Goya's PCI controller supports up to 8 MSI-X interrupts, which only 6 of them are currently used. The first 5 interrupts are dedicated for Goya's DMA engine queues. The 6th interrupt is dedicated for Goya's control CPU. The DMA queue will signal its MSI-X entry upon each completion of a command buffer that was placed on its primary queue. The driver will then mark that CB as completed and free the related resources. It will also update the command submission object which that CB belongs to. There is a dedicated event queue (EQ) between the driver and Goya's control CPU. The EQ is located on the Host memory. The control CPU writes a new entry to the EQ for various reasons, such as ECC error, MMU page fault, Hot temperature. After writing the new entry to the EQ, the control CPU will trigger its dedicated MSI-X entry to signal the driver that there is a new entry in the EQ. The driver will then read the entry and act accordingly. Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-18habanalabs: add h/w queues moduleOded Gabbay
This patch adds the H/W queues module and the code to initialize Goya's various compute and DMA engines and their queues. Goya has 5 DMA channels, 8 TPC engines and a single MME engine. For each channel/engine, there is a H/W queue logic which is used to pass commands from the user to the H/W. That logic is called QMAN. There are two types of QMANs: external and internal. The DMA QMANs are considered external while the TPC and MME QMANs are considered internal. For each external queue there is a completion queue, which is located on the Host memory. The differences between external and internal QMANs are: 1. The location of the queue's memory. External QMANs are located on the Host memory while internal QMANs are located on the on-chip memory. 2. The external QMAN write an entry to a completion queue and sends an MSI-X interrupt upon completion of a command buffer that was given to it. The internal QMAN doesn't do that. Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-18habanalabs: add basic Goya h/w initializationOded Gabbay
This patch adds the basic part of Goya's H/W initialization. It adds code that initializes Goya's internal CPU, various registers that are related to internal routing, scrambling, workarounds for H/W bugs, etc. It also initializes Goya's security scheme that prevents the user from abusing Goya to steal data from the host, crash the host, change Goya's F/W, etc. Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-18habanalabs: add command buffer moduleOded Gabbay
This patch adds the command buffer (CB) module, which allows the user to create and destroy CBs and to map them to the user's process address-space. A command buffer is a memory blocks that reside in DMA-able address-space and is physically contiguous so it can be accessed by the device without MMU translation. The command buffer memory is allocated using the coherent DMA API. When creating a new CB, the IOCTL returns a handle of it, and the user-space process needs to use that handle to mmap the buffer to get a VA in the user's address-space. Before destroying (freeing) a CB, the user must unmap the CB's VA using the CB handle. Each CB has a reference counter, which tracks its usage in command submissions and also its mmaps (only a single mmap is allowed). The driver maintains a pool of pre-allocated CBs in order to reduce latency during command submissions. In case the pool is empty, the driver will go to the slow-path of allocating a new CB, i.e. calling dma_alloc_coherent. Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-18habanalabs: add basic Goya supportOded Gabbay
This patch adds a basic support for the Goya device. The code initializes the device's PCI controller and PCI bars. It also initializes various S/W structures and adds some basic helper functions. Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>