Age | Commit message (Collapse) | Author |
|
Now that we have a subsystem for compute accelerators, move the
habanalabs driver to it.
This patch only moves the files and fixes the Makefiles. Future
patches will change the existing code to register to the accel
subsystem and expose the accel device char files instead of the
habanalabs device char files.
Update the MAINTAINERS file to reflect this change.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Add an option to control the timeout value for the driver's watchdog
of the reset process. The timeout represents the amount of the user
has to close his process once he gets a device reset notification from
the driver.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Several munmap() calls can be done or a mapped H/W block that has a
larger size than a page size.
Releasing the object should be done only when all mapped range is
unmapped.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
A common function that is called from multiple places can't be
located in degugfs.c because that file is only compiled if
debugfs is enabled in the kernel config file.
This can lead to undefined symbol compilation error.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Currently, to get engines status, user needed to read debugfs file
with root permissions.
This new uapi allows user apace apps retrieve status, so for example,
in case of failure, status can be retrieved immediately by the
application itself which runs without root permissions.
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Change is_idle functions so it would be more usable outside debugfs.
Do this by replacing seq_file parameter with regular string.
Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
In case security is enabled on the device, some debugfs nodes will
fail. Hence, we do not expose them.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Gaudi2 has new MMU units. A PMMU for device->host accesses, and HMMU
for HBM accesses.
The page tables of both MMUs are located in the host's memory (referred
to in the code as host-resident pgt).
Signed-off-by: Moti Haimovski <mhaimovski@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
In Gaudi2 we moved to a different wait for command submission
completion model. Instead of receiving interrupt only on external
queues, we use the device's sync manager to notify us when the
entire command submission finishes.
This enables us to remove the categorization of queues to external
and internal, and treat each queue equally, without the need to parse
and patch any command buffer.
This change also requires refactoring to the IRQ handling of
CS completions.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Add the ASIC-specific code for Gaudi2. Supply (almost) all of the
function callbacks that the driver's common code need to initialize,
finalize and submit workloads to the Gaudi2 ASIC.
It also contains the code to initialize the F/W of the Gaudi2 ASIC
and to receive events from the F/W.
It contains new debugfs entry to dump razwi events. razwi is a case
where the device's engines create a transaction that reaches an
invalid destination.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Region structure is derived from region type, hence no need to pass
it as an argument.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
move the field memory_scrub_val from struct hl_dbg_device_entry
to struct hl_device. This is because we want to use
this field also if debugfs is off.
Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Add the ability to scrub the device memory with a given value.
Add file 'dram_mem_scrub_val' to set the value
and a file 'dram_mem_scrub' to scrub the dram.
This is very important to help during automated tests, when you want
the CI system to randomize the memory before training certain
DL topologies.
Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
With the new code required for the flow added, we can now switch
to using the new memory manager infrastructure, removing the old code.
Signed-off-by: Yuri Nudelman <ynudelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Even if an IOMMU might be present for some PCI segment in the system,
that doesn't necessarily mean it provides translation for the device
we care about. Replace iommu_present() with a more appropriate check.
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
The habanalabs HW requires memory resources to be used by its
internal hardware structures. These structures are allocated and
initialized by the driver. We would like to use the device HBM for
that purpose. This memory is io-remapped and accessed using the
writel()/writeb()/writew() commands.
Since some of the HW structures are one byte in size we need to
add support for the writeb() and readb() functions in the driver.
Signed-off-by: Moti Haimovski <mhaimovski@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Instead of using for_each_sg when iterating sgt that contains dma
entries, use the more proper for_each_sgtable_dma_sg macro.
In addition, both Goya and Gaudi have the exact same implementation
of the asic function that encapsulate the usage of this macro, so
it is better to move that implementation to the common code.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Currently each asic version implements 4 callbacks:
'debugfs_{read32/write32/read64/write64}'
There is a lot of code duplication among the different
callbacks of all asic versions.
This patch unify the code in order to avoid the code
duplication by iterating the pci_mem_region array
in hl_device and use its fields instead of macros.
Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
When Gaudi device is secured the monitors data in the configuration
space is blocked from PCI access.
As we need to enable user to get sync-manager monitors registers when
debugging, this patch adds a debugfs that dumps the information to a
binary file (blob).
When a root user will trigger the dump, the driver will send request to
the f/w to fill a data structure containing dump of all monitors
registers.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
As the potential failure of the pci_enable_device(),
it should be better to check the return value and return
error if fails.
Fixes: 70b2f993ea4a ("habanalabs: create common folder")
Signed-off-by: Jiasheng Jiang <jiasheng@iscas.ac.cn>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
On Goya and Gaudi, the stop-on-error configuration can be set via
debugfs. However, in future devices, this configuration will always be
enabled.
Modify the debugfs node to be allowed only for ASICs that support this
dynamic configuration.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Due to the need of SynapseAI to configure all TPC engines from a single
QMAN, the driver must disable CGM and never allow the user to enable
it. Otherwise, the configuration of the TPC engines will fail.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Atomic operations during reset are replaced by a spinlock in order
to have the ability to protect more than a single variable.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Unify variables related to device reset, which will help us to
add some new reset functionality in future patches.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
There are multiple places where the code needs to get the context's
pointer and increment its ref cnt. This is the proper way instead
of using the compute context pointer in the device structure.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
I2C debugfs support is limited to 1 byte. We extend functionality
to more than 1 byte by using one of the pad fields as a length.
No backward compatibility issues as new F/W versions will treat 0
length as a 1 byte length transaction.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
VA range info could assist in debugging VA allocation bugs.
Signed-off-by: Yuri Nudelman <ynudelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Command submission timeout is currently determined during driver
loading time. As some environments requires this timeout to be
modified in runtime, we introduce a new debugfs node that controls
the timeout value without the need to reload the driver.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
It is useful to have the ability to see which user address was pinned
to which physical address during the initial mapping. We already have
all that info stored, but no means to search this data (which may be
quite large).
Signed-off-by: Yuri Nudelman <ynudelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
The address resolution via debugfs was not taking into consideration the
page offset, resulting in a wrong address.
Signed-off-by: Yuri Nudelman <ynudelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Currently userptr endpoint in debugfs prints out virtual addresses
in the user process memory space, without specifying their owner process
ID. User space virtual address is meaningless without knowing the owner
process.
Signed-off-by: Yuri Nudelman <ynudelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Recently, the size parameter in userptr structure was change to u64.
As a result, we need to change the type of the local range_size
in device_va_to_pa() to u64 to avoid overflow.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
To improve the user's ability to debug the case where a workload that
is part of executing training/inference of a topology is getting stuck,
we need to add a 'core dump' each time a CS times-out. The 'core dump'
shall contain all relevant Sync Manager information and corresponding
fence values.
The most recent dumps shall be accessible via debugfs, under
'state_dump' node. Reading from the node will provide the oldest dump
available. Writing an integer value X will discard X dumps, starting
with the oldest one, i.e. subsequent read will now return newer
dumps.
Signed-off-by: Yuri Nudelman <ynudelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Increase the size variable in the userptr structure to 64-bit. That
variable describes the size of the memory allocation of the user that
is now being mapped into the device. The mapping can be larger than
4GB, so we need to support it.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
We don't use typedefs so the enum name shouldn't end with _t
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
To be able to debug long-running CS better, without changing the
userspace code, we are adding a new option through debugfs interface
to skip the reset of the device in case of CS timeout.
Signed-off-by: Yuri Nudelman <ynudelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
When trying to debug program, the user often needs to
dump large parts of the device's DRAM, which can reach to tens of GBs.
Because reading from the device's internal memory through the PCI BAR
is extremely slow, the debug can take hours.
Instead, we can provide the user to copy data through one of the DMA
engines. This will make the operation much faster.
Currently, only GAUDI is supported.
In GAUDI, we need to find a PCI DMA engine that is IDLE and set the
DMA as secured to be able to bypass our MMU as we currently don't
map the temporary buffer to the MMU.
Example bash one-line to dump entire HBM to file (~2 minutes):
for (( i=0x0; i < 0x800000000; i+=0x8000000 )); do \
printf '0x%x\n' $i | sudo tee /sys/kernel/debug/habanalabs/hl0/addr ; \
echo 0x8000000 | sudo tee /sys/kernel/debug/habanalabs/hl0/dma_size ; \
sudo cat /sys/kernel/debug/habanalabs/hl0/data_dma >> hbm.txt ; done
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
In order to have a better debuggability we allow debugfs access
to user mmu mapped host memory. Non-user host memory access will be
rejected.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
if reset is due to heartbeat, device CPU is no responsive in which
case no point sending PCI disable message to it.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Improve "vm" debugfs node to print also the virtual addresses which are
currently mapped to HW blocks in the device.
Signed-off-by: Sagiv Ozeri <sozeri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
The dentry for the created debugfs file was being saved, but never used
anywhere. As the pointer isn't needed for anything, and the debugfs
files are being properly removed by removing the parent directory,
remove the saved pointer as well, saving a tiny bit of memory and logic.
Cc: Oded Gabbay <ogabbay@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Tomer Tayar <ttayar@habana.ai>
Cc: Moti Haimovski <mhaimovski@habana.ai>
Cc: Omer Shpigelman <oshpigelman@habana.ai>
Cc: Ofir Bitton <obitton@habana.ai>
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20210216150828.3855810-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Currently this API uses single 64 bits mask for engines idle indication.
Recently, it was observed that more bits are needed for some ASICs.
This patch modifies the use of the idle mask and the idle_extensions
mask.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
In mmu debugfs node show un-scrambled physical addresses.
before read/write through data nodes, need to unscramble the
physical address before using it for pci transaction.
Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
In order to improve driver security debuggability, we add
security violations dump to debugfs.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
DRAM physical page sizes depend of the amount of HBMs available in
the device. this number is device-dependent and may also be subject
to binning when one or more of the DRAM controllers are found to
to be faulty. Such a configuration may lead to partitioning the DRAM
to non-power-of-2 pages.
To support this feature we also need to add infrastructure of address
scarmbling.
Signed-off-by: Moti Haimovski <mhaimovski@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Modify the CS counter of a CB to be atomic, so no locking is required
when it is being modified or read.
Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
F/W message returns 64bit value but up until now we casted it to
a 32bit variable, instead of receiving 64bit in the first place.
Signed-off-by: Alon Mizrahi <amizrahi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
After the MMU-code refactoring, the existing MMU debugfs operations
are no longer working so we need to fix them.
In addition, remove the duplicate code that was in the debugfs code
and use the already existing MMU-code.
Signed-off-by: Moti Haimovski <mhaimovski@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
Whether an ASIC has MMU towards its DRAM is an ASIC property, so
move it to the asic fixed properties structure.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
The new state indicates that device should be reset in order
to re-gain funcionality.
This unique state can occur if reset_on_lockup is disabled
and an actual lockup has occurred.
Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|