summaryrefslogtreecommitdiff
path: root/drivers/misc/habanalabs
AgeCommit message (Collapse)Author
2020-09-22habanalabs: PCIe Advanced Error Reporting supportOfir Bitton
driver will now get notified upon any PCI error occurred and will respond according to the severity of the error. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2020-09-22habanalabs: expose sync manager resources allocation in INFO IOCTLOfir Bitton
Although the driver defines the first user-available sync manager object and monitor in habanalabs.h, we would like to also expose this information via the INFO IOCTL so the runtime can get this information dynamically. This is because in future ASICs we won't need to define it statically. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2020-09-22habanalabs: add information about PCIe controllerOfir Bitton
Update firmware header with new API for getting pcie info such as tx/rx throughput and replay counter. These counters are needed by customers for monitor and maintenance of multiple devices. Add new opcodes to the INFO ioctl to retrieve these counters. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2020-09-22habanalabs: Replace dma-fence mechanism with completionsOfir Bitton
habanalabs driver uses dma-fence mechanism for synchronization. dma-fence mechanism was designed solely for GPUs, hence we purpose a simpler mechanism based on completions to replace current dma-fence objects. Signed-off-by: Ofir Bitton <obitton@habana.ai> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2020-09-22habanalabs: increase length of ASIC nameOded Gabbay
Future ASIC names are longer than 15 chars so increase the variable length to 32 chars. Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com> Reviewed-by: Tomer Tayar <ttayar@habana.ai>
2020-08-31habanalabs: fix report of RAZWI initiator coordinatesOfir Bitton
All initiator coordinates received upon an 'MMU page fault RAZWI event' should be the routers coordinates, the only exception is the DMA initiators for which the reported coordinates correspond to their actual location. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2020-08-31habanalabs: prevent user buff overflowMoti Haimovski
This commit fixes a potential debugfs issue that may occur when reading the clock gating mask into the user buffer since the user buffer size was not taken into consideration. Signed-off-by: Moti Haimovski <mhaimovski@habana.ai> Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2020-08-22habanalabs: correctly report inbound pci region cfg errorOfir Bitton
During inbound iATU configuration we can get errors while configuring PCI registers, there is a certain scenario in which these errors are not reflected and driver is loaded with wrong configuration. Signed-off-by: Ofir Bitton <obitton@habana.ai> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2020-08-22habanalabs: check correct vmalloc return codeOfir Bitton
vmalloc can return different return code than NULL and a valid pointer. We must validate it in order to dereference a non valid pointer. Signed-off-by: Ofir Bitton <obitton@habana.ai> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2020-08-22habanalabs: validate FW file sizeOfir Bitton
We must validate FW size in order not to corrupt memory in case a malicious FW file will be present in system. Signed-off-by: Ofir Bitton <obitton@habana.ai> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2020-08-22habanalabs: fix incorrect check on failed workqueue createColin Ian King
The null check on a failed workqueue create is currently null checking hdev->cq_wq rather than the pointer hdev->cq_wq[i] and so the test will never be true on a failed workqueue create. Fix this by checking hdev->cq_wq[i]. Addresses-Coverity: ("Dereference before null check") Fixes: 5574cb2194b1 ("habanalabs: Assign each CQ with its own work queue") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2020-08-22habanalabs: set max power according to card typeOded Gabbay
In Gaudi, the default max power setting is different between PCI and PMC cards. Therefore, the driver need to set the default after knowing what is the card type. The current code has a bug where it limits the maximum power of the PMC card to 200W after a reset occurs. Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2020-08-22habanalabs: proper handling of alloc size in coresightOfir Bitton
Allocation size can go up to 64bit but truncated to 32bit, we should make sure it is not truncated and validate no address overflow. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2020-08-22habanalabs: set clock gating according to maskOfir Bitton
Once clock gating is set we enable clock gating according to mask, we should also disable clock gating according to relevant bits. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2020-08-22habanalabs: verify user input in cs_ioctl_signal_waitOfir Bitton
User input must be validated before using it to access internal structures. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2020-08-22habanalabs: Fix a loop in gaudi_extract_ecc_info()Dan Carpenter
The condition was reversed. It should have been less than instead of greater than. The result is that we never enter the loop. Fixes: fcc6a4e60678 ("habanalabs: Extract ECC information from FW") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2020-08-22habanalabs: Fix memory corruption in debugfsDan Carpenter
This has to be a long instead of a u32 because we write a long value. On 64 bit systems, this will cause memory corruption. Fixes: c216477363a3 ("habanalabs: add debugfs support") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2020-08-22habanalabs: validate packet id during CB parseOfir Bitton
During command buffer parsing, driver extracts packet id from user buffer. Driver must validate this packet id, since it is being used in order to extract information from internal structures. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2020-08-22habanalabs: Validate user address before mappingOfir Bitton
User address must be validated before driver performs address map. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2020-08-22habanalabs: unmap PCI bars upon iATU failureOfir Bitton
In case the driver fails to configure the PCI controller iATU, it needs to unmap the PCI bars before exiting so if the driver is removed, the bars won't be left mapped. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2020-07-29habanalabs: remove unused but set variable 'ctx_asid'Wei Yongjun
Gcc report warning as follows: drivers/misc/habanalabs/common/command_submission.c:373:6: warning: variable 'ctx_asid' set but not used [-Wunused-but-set-variable] 373 | int ctx_asid, rc; | ^~~~~~~~ This variable is not used in function cs_timedout(), this commit remove it to fix the warning. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Link: https://lore.kernel.org/r/20200729155902.33976-1-weiyongjun1@huawei.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-29habanalabs: goya_ctx_init() can be statickernel test robot
Signed-off-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/r/20200729000313.GA14680@e442e3f624c4 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-29habanalabs: fix up absolute include instructionsGreg Kroah-Hartman
There's no need to try to be cute with the include file locations in the Makefile, so just specify exactly where the files are. Bonus is this fixes the problem of building with O= as well as trying to just build the subdirectory alone. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Oded Gabbay <oded.gabbay@gmail.com> Cc: Omer Shpigelman <oshpigelman@habana.ai> Cc: Tomer Tayar <ttayar@habana.ai> Cc: Moti Haimovski <mhaimovski@habana.ai> Cc: Ofir Bitton <obitton@habana.ai> Cc: Ben Segal <bpsegal20@gmail.com> Cc: Christine Gharzuzi <cgharzuzi@habana.ai> Cc: Pawel Piskorski <ppiskorski@habana.ai> Link: https://lore.kernel.org/r/20200728171851.55842-1-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-27Merge 5.8-rc7 into char-misc-nextGreg Kroah-Hartman
This should resolve the merge/build issues reported when trying to create linux-next. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-24habanalabs: Fix memory leak in error flow of context initializationTomer Tayar
Add a missing free of the cs_pending array in the error flow of context initialization. Fixes: c16d45f42b64 ("habanalabs: Use pending CS amount per ASIC") Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2020-07-24habanalabs: use no flags on MMU cache invalidationTomer Tayar
gaudi_mmu_invalidate_cache() doesn't use the flags parameter, and thus it can be set to 0 when the function is called in the gaudi only files. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2020-07-24habanalabs: enable device before hw_init()Oded Gabbay
Device is now enabled before the hw_init() because part of the initialization requires communication with the device firmware to get information that is required for the initialization itself Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com> Reviewed-by: Tomer Tayar <ttayar@habana.ai>
2020-07-24habanalabs: create internal CB poolOfir Bitton
Create a device MMU-mapped internal command buffer pool, in order to allow the driver to allocate CBs for the signal/wait operations that are fetched by the queues when they are configured with the user's address space ID. We must pre-map this internal pool due to performance issues. This pool is needed for future ASIC support and it is currently unused in GOYA and GAUDI. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2020-07-24habanalabs: update hl_boot_if.h from firmwareOded Gabbay
Update the boot interface file from the latest version from firmware. Defines for secure boot were added. Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com> Reviewed-by: Omer Shpigelman <oshpigelman@habana.ai>
2020-07-24habanalabs: create common folderOded Gabbay
For internal needs of our CI we need to move all the common code into a common folder instead of putting them in the root folder of the driver. Same applies to the common header files under include/ Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com> Reviewed-by: Omer Shpigelman <oshpigelman@habana.ai>
2020-07-24habanalabs: check for DMA errors when clearing memoryMoti Haimovski
In GAUDI we use QMAN0 DMA for clearing the MMU memory region at initialization. if this operation fails it places the DMA in an error state and then when trying to initialize QMAN0 we fail and erroneously assume its the QMAN that failed. This commit adds a check and clear of such DMA errors at initialization so we will have a better understanding of what went wrong. Signed-off-by: Moti Haimovski <mhaimovski@habana.ai> Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2020-07-24habanalabs: verify queue can contain all cs jobsOfir Bitton
In order for the user to be aware of wrong inputs, we must return error in case the amount of jobs per cs exceeds the corresponding queue size. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2020-07-24habanalabs: Assign each CQ with its own work queueOfir Bitton
We identified a possible race during job completion when working with a single multi-threaded work queue. In order to overcome this race we suggest using a single threaded work queue per completion queue, hence we guarantee jobs completion in order. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2020-07-24habanalabs: halt device CPU only upon certain resetOded Gabbay
Currently the driver halts the device CPU in the halt engines function, which halts all the engines of the ASIC. The problem is that if later on we stop the reset process (due to inability to clean memory mappings in time), the CPU will remain in halt mode. This creates many issues, such as thermal/power control and FLR handling. Therefore, move the halting of the device CPU to the very end of the reset process, just before writing to the registers to initiate the reset. In addition, the driver now needs to send a message to the device F/W to disable it from sending interrupts to the host machine because during halt engines function the driver disables the MSI/MSI-X interrupts. Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com> Reviewed-by: Tomer Tayar <ttayar@habana.ai>
2020-07-24habanalabs: remove unused hashOmer Shpigelman
Remove an old hash that is not in use anymore. Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai> Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2020-07-24habanalabs: use queue pi/ci in order to determine queue occupancyOfir Bitton
Instead of using the free slots amount on the compute CQ to determine whether we can submit work to queues, use the queues pi/ci. This is needed in future ASICs where we don't have CQ per queue. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2020-07-24habanalabs: configure maximum queues per asicOfir Bitton
Currently the amount of maximum queues is statically configured. Using a static value is causing redundunt cycles when traversing all queues and consumes more memory than actually needed. In this patch we configure each asic with the exact number of queues needed. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2020-07-24habanalabs: remove soft-reset support from GAUDIOded Gabbay
Soft-reset isn't supported in GAUDI. Remove the code that performs it and print error in case the user wants to do it via sysfs. Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com> Reviewed-by: Tomer Tayar <ttayar@habana.ai>
2020-07-24habanalabs: PCIe iATU refactoringOfir Bitton
Divide iATU initialization into inbound/outbound methods. We must separate it in order to enable different match mode per PCIe region. In addition, added support for PCI address match mode. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2020-07-24habanalabs: Extract ECC information from FWOded Gabbay
ECC (Error Correcting Code) interrupts are going to be handled by the FW. Hence, we define an interface in which the driver can obtain the relevant ECC information. This information is needed for monitoring and can also lead to a hard reset if ECC error is not correctable. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2020-07-24habanalabs: Add dropped cs statistics info structOfir Bitton
Add command submission statistics structure which can be obtained through the info ioctl. Each drop counter describes the reason for which the command submission was dropped. This information is needed for the user to be aware of the specific reason for which the submitted work was dropped. The user can then utilize the driver more efficiently. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2020-07-24habanalabs: extract cpu boot status lookupChristine Gharzuzi
Extract detection of the cpu boot status to a function to allow code reuse Signed-off-by: Christine Gharzuzi <cgharzuzi@habana.ai> Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2020-07-24habanalabs: rephrase error messagesOded Gabbay
rephrase some error/warning/notice messages to make them more accessible to ordinary users. There is no need to print context ASID as the driver currently doesn't support multiple contexts. Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com> Reviewed-by: Tomer Tayar <ttayar@habana.ai>
2020-07-24habanalabs: Increase queues depthOfir Bitton
After recent concurrent cs amount increase, we must also increase queues depth since much more concurrent work can be done. All external queue depths were increased to 4096 as gaudi's internal queue depths were also increased to 1024. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2020-07-24habanalabs: rephrase error messageOmer Shpigelman
Rephrase F/W error message to make it more understandable to ordinary users. Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai> Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2020-07-24habanalabs: calculate trace frequency from PLLAdam Aharon
The profiler needs to know the PLL values for correctly showing the profiling data. Because our firmware can use different PLL configurations, we need to read the PLL values from the ASIC to pass them to the profiler. Signed-off-by: Adam Aharon <aaharon@habana.ai> Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2020-07-24habanalabs: align armcp_packet structure to 8 bytesOded Gabbay
Once there is a 64-bit field in a structure, GCC compiler for ARM aligns the structure to 8 bytes. In order to avoid confusion when these structures are being passed between CPUs from different architectures, we explicitly align the structure to 8 bytes. Reviewed-by: Omer Shpigelman <oshpigelman@habana.ai> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2020-07-24habanalabs: Use mask instead of shift in sync stream registersOfir Bitton
Use proper bitfield masks instead of shifting values when configuring packets sent to device. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2020-07-24habanalabs: sync stream generic functionalityOfir Bitton
Currently sync stream is limited only for external queues. We want to remove this constraint by adding a new queue property dedicated for sync stream. In addition we move the initialization and reset methods to the common code since we can re-use them with slight changes. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
2020-07-24habanalabs: Use pending CS amount per ASICOfir Bitton
Training schemes requires much more concurrent command submissions than inference does. In addition, training command submissions can be completed in a non serialized manner. Hence, we add support in which each ASIC will be able to configure the amount of concurrent pending command submissions, rather than use a predefined amount. This change will enhance performance by allowing the user to add more concurrent work without waiting for the previous work to be completed. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <oded.gabbay@gmail.com> Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>