summaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/amd/amdkfd
AgeCommit message (Collapse)Author
2024-08-20drm/amdkfd: APIs to stop/start KFD schedulingAmber Lin
Provide amdgpu_amdkfd_stop_sched() for amdgpu to stop KFD scheduling compute work on HIQ. amdgpu_amdkfd_start_sched() resumes the scheduling. When amdgpu_amdkfd_stop_sched is called, KFD will unmap queues from runlist. If users send ioctls to KFD to create queues, they'll be added but those queues won't be mapped to runlist (so not scheduled) until amdgpu_amdkfd_start_sched is called. v2: fix build (Alex) Signed-off-by: Amber Lin <Amber.Lin@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-13drm/amdkfd: Add node_id to location_id genericallyLijo Lazar
If there are multiple nodes per kfd device, add nodeid to location_id to differentiate. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-13drm/amdkfd: Handle queue destroy buffer access racePhilip Yang
Add helper function kfd_queue_unreference_buffers to reduce queue buffer refcount, separate it from release queue buffers. Because it is circular locking to hold dqm_lock to take vm lock, kfd_ioctl_destroy_queue should take vm lock, unreference queue buffers first, but not release queue buffers, to handle error in case failed to hold vm lock. Then hold dqm_lock to remove queue from queue list and then release queue buffers. Restore process worker restore queue hold dqm_lock, will always find the queue with valid queue buffers. v2 (Felix): - renamed kfd_queue_unreference_buffer(s) to kfd_queue_unref_bo_va(s) - added two FIXME comments for follow up Signed-off-by: Philip Yang <Philip.Yang@amd.com> Signed-off-by: Felix Kuehling <felix.kuehling@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-13drm/amdkfd: fix partition query when setting up recommended sdma enginesJonathan Kim
When users dynamically set the partition mode through sysfs writes, this can lead to a double lock situation where the KFD is trying to take the partition lock when updating the recommended SDMA engines. Have the KFD reference its saved socket device number count instead. Also ensure we have enough SDMA xGMI engines to report the recommended engines in the first place. Fixes: e06b71b2313a ("drm/amdkfd: allow users to target recommended SDMA engines") Signed-off-by: Jonathan Kim <jonathan.kim@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-06drm/amdkfd: fix debug watchpoints for logical devicesJonathan Kim
The number of watchpoints should be set and constrained per logical partition device, not by the socket device. Signed-off-by: Jonathan Kim <jonathan.kim@amd.com> Reviewed-by: Harish Kasiviswanathan <harish.kasiviswanathan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-06drm/amdkfd: support per-queue reset on gfx9Jonathan Kim
Support per-queue reset for GFX9. The recommendation is for the driver to target reset the HW queue via a SPI MMIO register write. Since this requires pipe and HW queue info and MEC FW is limited to doorbell reports of hung queues after an unmap failure, scan the HW queue slots defined by SET_RESOURCES first to identify the user queue candidates to reset. Only signal reset events to processes that have had a queue reset. If queue reset fails, fall back to GPU reset. Signed-off-by: Jonathan Kim <jonathan.kim@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-06drm/amdkfd: Fix compile error if HMM support not enabledPhilip Yang
Fixes the below if kernel config not enable HMM support >> drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_queue.c:107:26: error: implicit declaration of function 'svm_range_from_addr' >> drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_queue.c:107:24: error: assignment to 'struct svm_range *' from 'int' makes pointer from integer without a cast [-Wint-conversion] >> drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_queue.c:111:28: error: invalid use of undefined type 'struct svm_range' Fixes: b049504e211e ("drm/amdkfd: Validate user queue svm memory residency") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202407252127.zvnxaKRA-lkp@intel.com/ Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-07-27drm/amdkfd: Fix missing error code in kfd_queue_acquire_buffersSrinivasan Shanmugam
The fix involves setting 'err' to '-EINVAL' before each 'goto out_err_unreserve'. Fixes the below: drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_queue.c:265 kfd_queue_acquire_buffers() warn: missing error code 'err' drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_queue.c 226 int kfd_queue_acquire_buffers(struct kfd_process_device *pdd, struct queue_properties *properties) 227 { 228 struct kfd_topology_device *topo_dev; 229 struct amdgpu_vm *vm; 230 u32 total_cwsr_size; 231 int err; 232 233 topo_dev = kfd_topology_device_by_id(pdd->dev->id); 234 if (!topo_dev) 235 return -EINVAL; 236 237 vm = drm_priv_to_vm(pdd->drm_priv); 238 err = amdgpu_bo_reserve(vm->root.bo, false); 239 if (err) 240 return err; 241 242 err = kfd_queue_buffer_get(vm, properties->write_ptr, &properties->wptr_bo, PAGE_SIZE); 243 if (err) 244 goto out_err_unreserve; 245 246 err = kfd_queue_buffer_get(vm, properties->read_ptr, &properties->rptr_bo, PAGE_SIZE); 247 if (err) 248 goto out_err_unreserve; 249 250 err = kfd_queue_buffer_get(vm, (void *)properties->queue_address, 251 &properties->ring_bo, properties->queue_size); 252 if (err) 253 goto out_err_unreserve; 254 255 /* only compute queue requires EOP buffer and CWSR area */ 256 if (properties->type != KFD_QUEUE_TYPE_COMPUTE) 257 goto out_unreserve; This is clearly a success path. 258 259 /* EOP buffer is not required for all ASICs */ 260 if (properties->eop_ring_buffer_address) { 261 if (properties->eop_ring_buffer_size != topo_dev->node_props.eop_buffer_size) { 262 pr_debug("queue eop bo size 0x%lx not equal to node eop buf size 0x%x\n", 263 properties->eop_buf_bo->tbo.base.size, 264 topo_dev->node_props.eop_buffer_size); --> 265 goto out_err_unreserve; This has err in the label name. err = -EINVAL? 266 } 267 err = kfd_queue_buffer_get(vm, (void *)properties->eop_ring_buffer_address, 268 &properties->eop_buf_bo, 269 properties->eop_ring_buffer_size); 270 if (err) 271 goto out_err_unreserve; 272 } 273 274 if (properties->ctl_stack_size != topo_dev->node_props.ctl_stack_size) { 275 pr_debug("queue ctl stack size 0x%x not equal to node ctl stack size 0x%x\n", 276 properties->ctl_stack_size, 277 topo_dev->node_props.ctl_stack_size); 278 goto out_err_unreserve; err? 279 } 280 281 if (properties->ctx_save_restore_area_size != topo_dev->node_props.cwsr_size) { 282 pr_debug("queue cwsr size 0x%x not equal to node cwsr size 0x%x\n", 283 properties->ctx_save_restore_area_size, 284 topo_dev->node_props.cwsr_size); 285 goto out_err_unreserve; err? Not sure. 286 } 287 288 total_cwsr_size = (topo_dev->node_props.cwsr_size + topo_dev->node_props.debug_memory_size) 289 * NUM_XCC(pdd->dev->xcc_mask); 290 total_cwsr_size = ALIGN(total_cwsr_size, PAGE_SIZE); 291 292 err = kfd_queue_buffer_get(vm, (void *)properties->ctx_save_restore_area_address, 293 &properties->cwsr_bo, total_cwsr_size); 294 if (!err) 295 goto out_unreserve; 296 297 amdgpu_bo_unreserve(vm->root.bo); 298 299 err = kfd_queue_buffer_svm_get(pdd, properties->ctx_save_restore_area_address, 300 total_cwsr_size); 301 if (err) 302 goto out_err_release; 303 304 return 0; 305 306 out_unreserve: 307 amdgpu_bo_unreserve(vm->root.bo); 308 return 0; 309 310 out_err_unreserve: 311 amdgpu_bo_unreserve(vm->root.bo); 312 out_err_release: 313 kfd_queue_release_buffers(pdd, properties); 314 return err; 315 } Fixes: 629568d25fea ("drm/amdkfd: Validate queue cwsr area and eop buffer size") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Cc: Philip Yang <Philip.Yang@amd.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Philip Yang <Philip.Yang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-07-25drm/amdkfd: allow users to target recommended SDMA enginesJonathan Kim
Certain GPUs have better copy performance over xGMI on specific SDMA engines depending on the source and destination GPU. Allow users to create SDMA queues on these recommended engines. Close to 2x overall performance has been observed with this optimization. Signed-off-by: Jonathan Kim <jonathan.kim@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-07-24drm/amdkfd: Validate queue cwsr area and eop buffer sizePhilip Yang
When creating KFD user compute queue, check if queue eop buffer size, cwsr area size, ctl stack size equal to the size of KFD node properities. Check the entire cwsr area which may split into multiple svm ranges aligned to granularity boundary. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-07-24drm/amdkfd: Store queue cwsr area size to node propertiesPhilip Yang
Use the queue eop buffer size, cwsr area size, ctl stack size calculation from Thunk, store the value to KFD node properties. Those will be used to validate queue eop buffer size, cwsr area size, ctl stack size when creating KFD user compute queue. Those will be exposed to user space via sysfs KFD node properties, to remove the duplicate calculation code from Thunk. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-07-24drm/amdkfd: Validate user queue updatePhilip Yang
Ensure update queue new ring buffer is mapped on GPU with correct size. Decrease queue old ring_bo queue_refcount and increase new ring_bo queue_refcount. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-07-24drm/amdkfd: Validate user queue svm memory residencyPhilip Yang
Queue CWSR area maybe registered to GPU as svm memory, create queue to ensure svm mapped to GPU with KFD_IOCTL_SVM_FLAG_GPU_ALWAYS_MAPPED flag. Add queue_refcount to struct svm_range, to track queue CWSR area usage. Because unmap mmu notifier callback return value is ignored, if application unmap the CWSR area while queue is active, pr_warn message in dmesg log. To be safe, evict user queue. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-07-23drm/amdkfd: Ensure user queue buffers residencyPhilip Yang
Add atomic queue_refcount to struct bo_va, return -EBUSY to fail unmap BO from the GPU if the bo_va queue_refcount is not zero. Create queue to increase the bo_va queue_refcount, destroy queue to decrease the bo_va queue_refcount, to ensure the queue buffers mapped on the GPU when queue is active. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-07-23drm/amdkfd: Validate user queue buffersPhilip Yang
Find user queue rptr, ring buf, eop buffer and cwsr area BOs, and check BOs are mapped on the GPU with correct size and take the BO reference. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-07-23drm/amdkfd: Refactor queue wptr_bo GART mappingPhilip Yang
Add helper function kfd_queue_acquire_buffers to get queue wptr_bo reference from queue write_ptr if it is mapped to the KFD node with expected size. Add wptr_bo to structure queue_properties because structure queue is allocated after queue buffers are validated, then we can remove wptr_bo parameter from pqm_create_queue. Rename structure queue wptr_bo_gart to hold wptr_bo reference for GART mapping and umapping. Move MES wptr_bo_gart mapping to init_user_queue, the same location with queue ctx_bo GART mapping. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-07-23drm/amdkfd: amdkfd_free_gtt_mem clear the correct pointerPhilip Yang
Pass pointer reference to amdgpu_bo_unref to clear the correct pointer, otherwise amdgpu_bo_unref clear the local variable, the original pointer not set to NULL, this could cause use-after-free bug. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-07-23drm/amdkfd: kfd_bo_mapped_dev support partitionPhilip Yang
Change amdgpu_amdkfd_bo_mapped_to_dev to use drm_priv as parameter instead of adev, to support spatial partition. This is only used by CRIU checkpoint restore now. No functional change. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-07-08drm/amdgpu: set CP_HQD_PQ_DOORBELL_CONTROL.DOORBELL_MODE to 1Zhigang Luo
to avoid reading wrong WPTR from doorbell in sriov vf, set CP_HQD_PQ_DOORBELL_CONTROL.DOORBELL_MODE to 1 to read WPTR from MQD. Signed-off-by: Zhigang Luo <Zhigang.Luo@amd.com> Acked-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-07-08drm/amdgpu: add ras POSION_CONSUMPTION event id supportYang Wang
add amdgpu ras POSION_CONSUMPTION event id support. Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-07-08drm/amdkfd: Use mode1 reset for GFX v9.4.4Stanley.Yang
GFX v9.4.4 uses mode1 reset to handle poison consumption. Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-07-02drm/amdkfd: add KFD support for SDMA IP v6.1.2Tim Huang
Enable KFD setting SDMA info for SDMA 6.1.2. Signed-off-by: Tim Huang <Tim.Huang@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-07-02drm/amdkfd: add KFD support for GC IP v11.5.2Tim Huang
Enable KFD for GC 11.5.2. Signed-off-by: Tim Huang <Tim.Huang@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-07-01drm/amdkfd: Use device based logging for errorsLijo Lazar
Convert some pr_* to some dev_* APIs to identify the device. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-06-27drm/amdgpu/kfd: Add unlock() on error path to add_queue_mes()Dan Carpenter
We recently added locking to add_queue_mes() but this error path was overlooked. Add an unlock to the error path. Fixes: 1802b042a343 ("drm/amdgpu/kfd: remove is_hws_hang and is_resetting") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-06-14drm/amdkfd: add ASIC version check for the reset selection of RAS poisonTao Zhou
GFX v9.4.3 uses mode1 reset, other ASICs choose mode2. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Acked-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-06-14drm/amdkfd: use mode1 reset for RAS poison consumptionTao Zhou
Per firmware's requirement, replace mode2 with mode1. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-06-14drm/amdkfd: Extend gfx12 trap handler fix to gfx10/11Jay Cornwall
In commit fda812ebe3d9 ("drm/amdkfd: gfx12 context save/restore trap handler fixes") the following fix was introduced but incorrectly restricted to gfx12. The same issue and a corresponding fix apply to gfx10 and gfx11. Do not overwrite TRAPSTS.{SAVECTX,HOST_TRAP} when restoring this register. Both of these fields can assert while the wavefront is running the trap handler. Signed-off-by: Jay Cornwall <jay.cornwall@amd.com> Reviewed-by: Lancelot Six <lancelot.six@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-06-14drm/amdgpu: add lock in kfd_process_dequeue_from_deviceYunxiang Li
We need to take the reset domain lock before talking to MES. While in this case we can take the lock inside the mes helper. We can't do so for most other mes helpers since they are used during reset. So for consistency sake we add the lock here. Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-06-14drm/amdgpu/kfd: remove is_hws_hang and is_resettingYunxiang Li
is_hws_hang and is_resetting serves pretty much the same purpose and they all duplicates the work of the reset_domain lock, just check that directly instead. This also eliminate a few bugs listed below and get rid of dqm->ops.pre_reset. kfd_hws_hang did not need to avoid scheduling another reset. If the on-going reset decided to skip GPU reset we have a bad time, otherwise the extra reset will get cancelled anyway. remove_queue_mes forgot to check is_resetting flag compared to the pre-MES path unmap_queue_cpsch, so it did not block hw access during reset correctly. Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-06-05drm/amdkfd: remove logically dead codeJesse Zhang
idr_for_each_entry can ensure that mem is not empty during the loop. So don't need check mem again. Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-06-05drm/amdkfd: add reset cause in gpu pre-reset smi eventEric Huang
reset cause is requested by customer as additional info for gpu reset smi event. v2: integerate reset sources suggested by Lijo Lazar Signed-off-by: Eric Huang <jinhuieric.huang@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-06-05drm/amdkfd: remove dead code in kfd_create_vcrat_image_gpuJesse Zhang
kfd_create_vcrat_image_gpu itself checks the avail_size at the start. So the value of avail_size is at least VCRAT_SIZE_FOR_GPU(16384), minus struct crat_header(40UL) and struct crat_subtype_compute(40UL) it cannot be less than 0. Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-06-05drm/amdkfd: fix the kdf debugger issueJesse Zhang
The expression caps | HSA_CAP_TRAP_DEBUG_PRECISE_MEMORY_OPERATIONS_SUPPORTED and caps | HSA_CAP_TRAP_DEBUG_PRECISE_ALU_OPERATIONS_SUPPORTED are always 1/true regardless of the values of its operand. Fixes: 9243240bed38 ("drm/amdkfd: enable single alu ops for gfx12") Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Suggested-by: Felix Kuehling <felix.kuehling@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-06-05drm/amdkfd: Comment out the unused variable use_static in pm_map_queues_v9Jesse Zhang
To fix the warning about unused value, remove the use_static and use the parameter is_static directly. Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Suggested-by: Felix Kuehling <felix.kuehling@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-06-05drm/amdkfd: Handle deallocated VPGRs in gfx11+ trap handlerJay Cornwall
A wavefront may deallocate its VGPRs at the end of a program while waiting for memory transactions to complete. If it subsequently receives a context save exception it will be unable to save, since this requires VGPRs. In this case the trap handler should terminate the wavefront. Fixes intermittent VM faults under context switching load. V2: Use S_ENDPGM instead of S_ENDPGM_SAVED for performance counters Signed-off-by: Jay Cornwall <jay.cornwall@amd.com> Reviewed-by: Lancelot Six <lancelot.six@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-06-05drm/amdkfd: remove dead code in the function svm_range_get_pte_flagsJesse Zhang
The varible uncached set false, the condition uncached cannot be true. So remove the dead code, mapping flags will set the flag AMDGPU_VM_MTYPE_UC in else. Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-06-05drm/amdkfd: gfx12 context save/restore trap handler fixesJay Cornwall
Fix LDS size interpretation: 512 bytes (>= gfx12) vs 256 (< gfx12). Ensure STATE_PRIV.BARRIER_COMPLETE cannot change after reading or before writing. Other waves in the threadgroup may cause this field to assert if they complete the barrier. Do not overwrite EXCP_FLAG_PRIV.{SAVE_CONTEXT,HOST_TRAP} when restoring this register. Both of these fields can assert while the wavefront is running the trap handler. Signed-off-by: Jay Cornwall <jay.cornwall@amd.com> Reviewed-by: Lancelot Six <lancelot.six@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-29drm/amdkfd: Replace deprecated gfx12 trap handler instructionsJay Cornwall
Newer assemblers reject S_WAITCNT. All instances of S_WAITCNT can be replaced by S_WAITCNT 0 (< gfx12) or S_WAIT_IDLE (>= gfx12) since there is no concurrency of different memory instruction classes. Signed-off-by: Jay Cornwall <jay.cornwall@amd.com> Reviewed-by: Lancelot Six <lancelot.six@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-29drm/amdkfd: Sync trap handler binary with sourceJay Cornwall
Source and binary have become mismatched during branch activity. Signed-off-by: Jay Cornwall <jay.cornwall@amd.com> Reviewed-by: Lancelot Six <lancelot.six@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-29drm/amdkfd: simplify APU VRAM handlingAlex Deucher
With commit 89773b85599a ("drm/amdkfd: Let VRAM allocations go to GTT domain on small APUs") big and small APU "VRAM" handling in KFD was unified. Since AMD_IS_APU is set for both big and small APUs, we can simplify the checks in the code. v2: clean up a few more places (Lang) Acked-by: Felix Kuehling <felix.kuehling@amd.com> Reviewed-by: Lang Yu <Lang.Yu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-29Revert "drm/amdkfd: fix gfx_target_version for certain 11.0.3 devices"Alex Deucher
This reverts commit 28ebbb4981cb1fad12e0b1227dbecc88810b1ee8. Revert this commit as apparently the LLVM code to take advantage of this never landed. Reviewed-by: Feifei Xu <Feifei.Xu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: Feifei Xu <feifei.xu@amd.com>
2024-05-29drm/amdgpu: drop MES 10.1 support v3Alex Deucher
It was an enablement vehicle for MES 11 and was never productized. Remove it. v2: drop additional checks in the GFX10 code. v3: drop mes_api_def.h Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-17drm/amdkfd: Add GFX1201 to svm_range_get_pte_flags functionSreekant Somasekharan
GFX1201 was missed in the commit below. Adding it in. Fixes: 628e1ace2379 ("drm/amdkfd: mark GFX12 system and peer GPU memory mappings as MTYPE_NC") Signed-off-by: Sreekant Somasekharan <sreekant.somasekharan@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-17drm/amdkfd: Ensure gpu_id is uniqueHarish Kasiviswanathan
gpu_id needs to be unique for user space to identify GPUs via KFD interface. In the current implementation there is a very small probability of having non unique gpu_ids. v2: Add check to confirm if gpu_id is unique. If not unique, find one Changed commit header to reflect the above v3: Use crc16 as suggested-by: Lijo Lazar <lijo.lazar@amd.com> Ensure that gpu_id != 0 Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-13drm/amdkfd: Use dev_error intead of pr_errorHarish Kasiviswanathan
No functional change. This will help in moving gpu_id creation to next step while still being able to identify the correct GPU Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-13drm/amdkfd: Fix CU Masking for GFX 9.4.3Mukul Joshi
We are incorrectly passing the first XCC's MQD when updating CU masks for other XCCs in the partition. Fix this by passing the MQD for the XCC currently being updated with CU mask to update_cu_mask function. Fixes: fc6efed2c728 ("drm/amdkfd: Update CU masking for GFX 9.4.3") Signed-off-by: Mukul Joshi <mukul.joshi@amd.com> Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-13drm/amd/amdkfd: Fix a resource leak in svm_range_validate_and_map()Ramesh Errabolu
Analysis of code by Coverity, a static code analyser, has identified a resource leak in the symbol hmm_range. This leak occurs when one of the prior steps before it is released encounters an error. Signed-off-by: Ramesh Errabolu <Ramesh.Errabolu@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-13drm/amdkfd: Remove arbitrary timeout for hmm_range_faultPhilip Yang
On system with khugepaged enabled and user cases with THP buffer, the hmm_range_fault may takes > 15 seconds to return -EBUSY, the arbitrary timeout value is not accurate, cause memory allocation failure. Remove the arbitrary timeout value, return EAGAIN to application if hmm_range_fault return EBUSY, then userspace libdrm and Thunk will call ioctl again. Change EAGAIN to debug message as this is not error. Signed-off-by: Philip Yang <Philip.Yang@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-08drm/amdkfd: Reconcile the definition and use of oem_id in struct ↵Michael Chen
kfd_topology_device Currently oem_id is defined as uint8_t[6] and casted to uint64_t* in some use case. This would lead code scanner to complain about access beyond. Re-define it in union to enforce 8-byte size and alignment to avoid potential issue. Signed-off-by: Michael Chen <michael.chen@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>