summaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
AgeCommit message (Collapse)Author
2025-07-21Merge tag 'amd-drm-next-6.17-2025-07-17' of ↵Dave Airlie
https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-6.17-2025-07-17: amdgpu: - Partition fixes - Reset fixes - RAS fixes - i2c fix - MPC updates - DSC cleanup - EDID fixes - Display idle D3 update - IPS updates - DMUB updates - Retimer fix - Replay fixes - Fix DC memory leak - Initial support for smartmux - DCN 4.0.1 degamma LUT fix - Per queue reset cleanups - Track ring state associated with a fence - SR-IOV fixes - SMU fixes - Per queue reset improvements for GC 9+ compute - Per queue reset improvements for GC 10+ gfx - Per queue reset improvements for SDMA 5+ - Per queue reset improvements for JPEG 2+ - Per queue reset improvements for VCN 2+ - GC 8 fix - ISP updates amdkfd: - Enable KFD on LoongArch radeon: - Drop console lock during suspend/resume UAPI: - Add userq slot info to INFO IOCTL Used for IGT userq validation tests (https://lists.freedesktop.org/archives/igt-dev/2025-July/093228.html) From: Alex Deucher <alexander.deucher@amd.com> Link: https://lore.kernel.org/r/20250717213827.2061581-1-alexander.deucher@amd.com Signed-off-by: Dave Airlie <airlied@redhat.com>
2025-07-17drm/amdgpu: move reset support type checks into the callerAlex Deucher
Rather than checking in the callbacks, check if the reset type is supported in the caller. Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-16drm/amdgpu: track ring state associated with a fenceAlex Deucher
We need to know the wptr and sequence number associated with a fence so that we can re-emit the unprocessed state after a ring reset. Pre-allocate storage space for the ring buffer contents and add helpers to save off and re-emit the unprocessed state so that it can be re-emitted after the queue is reset. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-15drm/sched: Rename DRM_GPU_SCHED_STAT_NOMINAL to DRM_GPU_SCHED_STAT_RESETMaíra Canal
Among the scheduler's statuses, the only one that indicates an error is DRM_GPU_SCHED_STAT_ENODEV. Any status other than DRM_GPU_SCHED_STAT_ENODEV signifies that the operation succeeded and the GPU is in a nominal state. However, to provide more information about the GPU's status, it is needed to convey more information than just "OK". Therefore, rename DRM_GPU_SCHED_STAT_NOMINAL to DRM_GPU_SCHED_STAT_RESET, which better communicates the meaning of this status. The status DRM_GPU_SCHED_STAT_RESET indicates that the GPU has hung, but it has been successfully reset and is now in a nominal state again. Reviewed-by: Philipp Stanner <phasta@kernel.org> Link: https://lore.kernel.org/r/20250714-sched-skip-reset-v6-1-5c5ba4f55039@igalia.com Signed-off-by: Maíra Canal <mcanal@igalia.com>
2025-07-10drm/amdgpu: Fix lifetime of struct amdgpu_task_info after ring resetAndré Almeida
When a ring reset happens, amdgpu calls drm_dev_wedged_event() using struct amdgpu_task_info *ti as one of the arguments. After using *ti, a call to amdgpu_vm_put_task_info(ti) is required to correctly track its lifetime. However, it's called from a place that the ring reset path never reaches due to a goto after drm_dev_wedged_event() is called. Move amdgpu_vm_put_task_info() bellow the exit label to make sure that it's called regardless of the code path. amdgpu_vm_put_task_info() can only accept a valid address or NULL as argument, so initialise *ti to make sure we can call this function if *ti isn't used. Fixes: a72002cb181f ("drm/amdgpu: Make use of drm_wedge_task_info") Reported-by: Dave Airlie <airlied@gmail.com> Closes: https://lore.kernel.org/dri-devel/CAPM=9tz0rQP8VZWKWyuF8kUMqRScxqoa6aVdwWw9=5yYxyYQ2Q@mail.gmail.com/ Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://lore.kernel.org/r/20250704030629.1064397-1-andrealmeid@igalia.com Signed-off-by: André Almeida <andrealmeid@igalia.com>
2025-07-04Merge tag 'amd-drm-next-6.17-2025-07-01' of ↵Dave Airlie
https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-6.17-2025-07-01: amdgpu: - FAMS2 fixes - OLED fixes - Misc cleanups - AUX fixes - DMCUB updates - SR-IOV hibernation support - RAS updates - DP tunneling fixes - DML2 fixes - Backlight improvements - Suspend improvements - Use scaling for non-native modes on eDP - SDMA 4.4.x fixes - PCIe DPM fixes - SDMA 5.x fixes - Cleaner shader updates for GC 9.x - Remove fence slab - ISP genpd support - Parition handling rework - SDMA FW checks for userq support - Add missing firmware declaration - Fix leak in amdgpu_ctx_mgr_entity_fini() - Freesync fix - Ring reset refactoring - Legacy dpm verbosity changes amdkfd: - GWS fix - mtype fix for ext coherent system memory - MMU notifier fix - gfx7/8 fix radeon: - CS validation support for additional GL extensions - Bump driver version for new CS validation checks From: Alex Deucher <alexander.deucher@amd.com> Link: https://lore.kernel.org/r/20250701194707.32905-1-alexander.deucher@amd.com Signed-off-by: Dave Airlie <airlied@redhat.com>
2025-06-30drm/amdgpu: move scheduler wqueue handling into callbacksAlex Deucher
Move the scheduler wqueue stopping and starting into the ring reset callbacks. On some IPs we have to reset an engine which may have multiple queues. Move the wqueue handling into the backend so we can handle them as needed based on the type of reset available. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-30drm/amdgpu: move guilty handling into ring resetsAlex Deucher
Move guilty logic into the ring reset callbacks. This allows each ring reset callback to better handle fence errors and force completions in line with the reset behavior for each IP. It also allows us to remove the ring guilty callback since that logic now lives in the reset callback. Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-30drm/amdgpu: move force completion into ring resetsAlex Deucher
Move the force completion handling into each ring reset function so that each engine can determine whether or not it needs to force completion on the jobs in the ring. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-30drm/amdgpu: rework queue reset scheduler interactionChristian König
Stopping the scheduler for queue reset is generally a good idea because it prevents any worker from touching the ring buffer. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-30drm/amdgpu: update ring reset function signatureAlex Deucher
Going forward, we'll need more than just the vmid. Add the guilty amdgpu_fence. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18drm/amdgpu: switch job hw_fence to amdgpu_fenceAlex Deucher
Use the amdgpu fence container so we can store additional data in the fence. This also fixes the start_time handling for MCBP since we were casting the fence to an amdgpu_fence and it wasn't. Fixes: 3f4c175d62d8 ("drm/amdgpu: MCBP based on DRM scheduler (v9)") Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-17drm/amdgpu: Make use of drm_wedge_task_infoAndré Almeida
To notify userspace about which task (if any) made the device get in a wedge state, make use of drm_wedge_task_info parameter, filling it with the task PID and name. Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://lore.kernel.org/r/20250617124949.2151549-7-andrealmeid@igalia.com Signed-off-by: André Almeida <andrealmeid@igalia.com>
2025-06-17drm: Create a task info option for wedge eventsAndré Almeida
When a device get wedged, it might be caused by a guilty application. For userspace, knowing which task was involved can be useful for some situations, like for implementing a policy, logs or for giving a chance for the compositor to let the user know what task was involved in the problem. This is an optional argument, when the task info is not available, the PID and TASK string won't appear in the event string. Sometimes just the PID isn't enough giving that the task might be already dead by the time userspace will try to check what was this PID's name, so to make the life easier also notify what's the task's name in the user event. Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Krzysztof Karas <krzysztof.karas@intel.com> Reviewed-by: Raag Jadav <raag.jadav@intel.com> Acked-by: Christian König <christian.koenig@amd.com> Link: https://lore.kernel.org/r/20250617124949.2151549-4-andrealmeid@igalia.com Signed-off-by: André Almeida <andrealmeid@igalia.com>
2025-06-17drm: amdgpu: Create amdgpu_vm_print_task_info()André Almeida
To avoid repetitive code in amdgpu, create a function that prints the content of struct amdgpu_task_info. Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://lore.kernel.org/r/20250617124949.2151549-3-andrealmeid@igalia.com Signed-off-by: André Almeida <andrealmeid@igalia.com>
2025-05-28drm/sched: Store the drm client_id in drm_sched_fencePierre-Eric Pelloux-Prayer
This will be used in a later commit to trace the drm client_id in some of the gpu_scheduler trace events. This requires changing all the users of drm_sched_job_init to add an extra parameter. The newly added drm_client_id field in the drm_sched_fence is a bit of a duplicate of the owner one. One suggestion I received was to merge those 2 fields - this can't be done right now as amdgpu uses some special values (AMDGPU_FENCE_OWNER_*) that can't really be translated into a client id. Christian is working on getting rid of those; when it's done we should be able to squash owner/drm_client_id together. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> Signed-off-by: Philipp Stanner <phasta@kernel.org> Link: https://lore.kernel.org/r/20250526125505.2360-3-pierre-eric.pelloux-prayer@amd.com
2025-03-21drm/amdgpu: rework how isolation is enforced v2Christian König
Limiting the number of available VMIDs to enforce isolation causes some issues with gang submit and applying certain HW workarounds which require multiple VMIDs to work correctly. So instead start to track all submissions to the relevant engines in a per partition data structure and use the dma_fences of the submissions to enforce isolation similar to what a VMID limit does. v2: use ~0l for jobs without isolation to distinct it from kernel submissions which uses NULL for the owner. Add some warning when we are OOM. Signed-off-by: Christian König <christian.koenig@amd.com> Acked-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-03-10drm/amdgpu: Trigger a wedged event for ring resetAndré Almeida
Instead of only triggering a wedged event for complete GPU resets, trigger for ring resets. Regardless of the reset, it's useful for userspace to know that it happened because the kernel will reject further submissions from that app. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: André Almeida <andrealmeid@igalia.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-03-10Merge tag 'amd-drm-next-6.15-2025-03-07' of ↵Dave Airlie
https://gitlab.freedesktop.org/agd5f/linux into drm-next amdgpu: - Fix spelling typos - RAS updates - VCN 5.0.1 updates - SubVP fixes - DCN 4.0.1 fixes - MSO DPCD fixes - DIO encoder refactor - PCON fixes - Misc cleanups - DMCUB fixes - USB4 DP fixes - DM cleanups - Backlight cleanups and fixes - Support platform backlight curves - Misc code cleanups - SMU 14 fixes - JPEG 4.0.3 reset updates - SR-IOV fixes - SVM fixes - GC 12 DCC fixes - DC DCE 6.x fix - Hiberation fix amdkfd: - Fix possible NULL pointer in queue validation - Remove unnecessary CP domain validation - SDMA queue reset support - Add per process flags radeon: - Fix spelling typos - RS400 hyperZ fix UAPI: - Add KFD per process flags for setting precision Proposed user space: https://github.com/ROCm/ROCR-Runtime/commit/2a64fa5e06e80e0af36df4ce0c76ae52eeec0a9d Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250307211051.1880472-1-alexander.deucher@amd.com
2025-02-27drm/amdgpu: Create a debug option to disable ring resetAndré Almeida
Prior to the addition of ring reset, the debug option `debug_disable_soft_recovery` could be used to force a full device reset. Now that we have ring reset, create a debug option to disable them in amdgpu, forcing the driver to go with the full device reset path again when both options are combined. This option is useful for testing and debugging purposes when one wants to test the full reset from userspace. Signed-off-by: André Almeida <andrealmeid@igalia.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-02-25drm/amdgpu: Log after a successful ring resetAndré Almeida
When a ring reset happens, the kernel log shows only "amdgpu: Starting <ring name> ring reset", but when it finishes nothing appears in the log. Explicitly write in the log that the reset has finished correctly. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: André Almeida <andrealmeid@igalia.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-02-25drm/amdgpu: Update amdgpu_job_timedout to check if the ring is guiltyJesse.zhang@amd.com
This patch updates the `amdgpu_job_timedout` function to check if the ring is actually guilty of causing the timeout. If not, it skips error handling and fence completion. v2: move the is_guilty check down into the queue reset area (Alex) v3: need to call is_guilty before reset (Alex) v4: squash in is_guilty logic fixes (Alex) Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Jesse Zhang <jesse.zhang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-02-24drm/amdgpu: Pop jobs from the queue more robustlyTvrtko Ursulin
Replace a copy of DRM scheduler's to_drm_sched_job with a copy of a newly added drm_sched_entity_queue_pop. This allows breaking the hidden dependency that queue_node has to be the first element in struct drm_sched_job. A comment is also added with a reference to the mailing list discussion explaining the copied helper will be removed when the whole broken amdgpu_job_stop_all_jobs_on_sched is removed. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Cc: Christian König <christian.koenig@amd.com> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Philipp Stanner <phasta@kernel.org> Cc: Zhang, Hawking <Hawking.Zhang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Philipp Stanner <phasta@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20250221105038.79665-3-tvrtko.ursulin@igalia.com
2024-12-18drm/amdgpu: partially revert "reduce reset time"Christian König
This partially reverts commit 194eb174cbe4fe2b3376ac30acca2dc8c8beca00. This commit introduced a new state variable into adev without even remotely worrying about CPU barriers. Since we already have the amdgpu_in_reset() function exactly for this use case partially revert that. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-12-18drm/amdgpu: set the VM pointer to NULL in amdgpu_job_prepareChristian König
As soon as the prepare phase is completed the VM might be released, better set it to NULL. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-12-18drm/amdgpu: remove useless init from amdgpu_job_allocPierre-Eric Pelloux-Prayer
This init is useless because base.sched will be cleared to 0 in drm_sched_job_init because of commit 2320c9e6a768 ("drm/sched: memset() 'job' in drm_sched_job_init()"). Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-12-18drm/amdgpu: drop the amdgpu_device argument from amdgpu_ib_freePierre-Eric Pelloux-Prayer
It's unused. Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-12-18drm/amdgpu: don't access invalid schedPierre-Eric Pelloux-Prayer
Since 2320c9e6a768 ("drm/sched: memset() 'job' in drm_sched_job_init()") accessing job->base.sched can produce unexpected results as the initialisation of (*job)->base.sched done in amdgpu_job_alloc is overwritten by the memset. This commit fixes an issue when a CS would fail validation and would be rejected after job->num_ibs is incremented. In this case, amdgpu_ib_free(ring->adev, ...) will be called, which would crash the machine because the ring value is bogus. To fix this, pass a NULL pointer to amdgpu_ib_free(): we can do this because the device is actually not used in this function. The next commit will remove the ring argument completely. Fixes: 2320c9e6a768 ("drm/sched: memset() 'job' in drm_sched_job_init()") Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-11-08Merge tag 'amd-drm-next-6.13-2024-11-06' of ↵Dave Airlie
https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-6.13-2024-11-06: amdgpu: - Misc cleanups - OLED fixes - DCN 4.x fixes - DCN 3.5 fixes - 8K fixes - IPS fixes - DSC fixes - S3 fix - KASAN fix - SMU13 fixes - fdinfo fixes - USB-C fixes - ACPI fix - Fix dummy page overlapping mappings - Fix workload profile handling - Add user control for zero RPM on SMU13 - Cleaner shader updates - Stop syncing PRT map operations - Debugfs permissions fixes - Debugfs bounds check fix - RAS cleanups - Enforce isolation updates amdkfd: - Add topology cap flag for per queue reset - Add an interface to query whether KFD queues are present - Use dynamic allocation for get_cu_occupancy From: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241106163904.189108-1-alexander.deucher@amd.com Signed-off-by: Dave Airlie <airlied@redhat.com>
2024-11-04drm/amdgpu: add ring reset messagesAlex Deucher
Add messages to make it clear when a per ring reset happens. This is helpful for debugging and aligns with other reset methods. v2: add ring name in success/fail messages (Lijo) Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Kent Russell <kent.russell@amd.com> (v1) Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-10-29Merge tag 'amd-drm-next-6.13-2024-10-25' of ↵Dave Airlie
https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-6.13-2024-10-25: amdgpu: - SDMA queue reset support - SMU 13.0.6 updates - Add debugfs interface to help limit jpeg queue scheduling for testing - JPEG 4.0.3 updates - Initial runtime repartitioning support - GFX9 fixes - Misc code cleanups - Rework IP structures to better handle multiple instances of an IP - DML updates - DSC fixes - HDR fixes - Brightness control updates - Runtime pm cleanup - DMCUB fixes - DCN 3.5 updates - Struct drm_edid cleanup - Fetch EDID from _DDC if available - Ring noop optimizations - MES logging fixes - 3DLUT fixes - DCN 4.x fixes - SMU 13.x fixes - Fixes for set_soft_freq_range() - ACPI fixes - SMU 14.x updates - PSR-SU fixes - fdinfo cleanup - DCN documentation updates amdkfd: - Misc code cleanups - Increase event FIFO size - Copy wave state fixes for SDMA radeon: - Fix possible overflow in packet3 check - Late init connector fix - Always set GEM function pointer Documentation: - Update drm-memory documentation From: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241025132336.2416913-1-alexander.deucher@amd.com Signed-off-by: Dave Airlie <airlied@redhat.com>
2024-10-09Merge tag 'drm-misc-next-2024-09-20' of ↵Dave Airlie
https://gitlab.freedesktop.org/drm/misc/kernel into drm-next drm-misc-next for v6.12: UAPI Changes: - Add panthor/DEV_QUERY_TIMESTAMP_INFO query. Cross-subsystem Changes: - Updated dt bindings. - Add documentation explaining default errnos for fences. - Mark dma-buf heaps creation functions as __init. Core Changes: - Split DSC helpers from DP helpers. - Clang build fixes for drm/mm test. - Remove simple pipeline support for gem-vram, no longer any users left after converting bochs. - Add erno to drm_sched_start to distinguish between GPU and queue reset. - Add drm_framebuffer testcases. - Fix uninitialized spinlock acquisition with CONFIG_DRM_PANIC=n. - Use read_trylock instead of read_lock in dma_fence_begin_signalling to quiesce lockdep. Driver Changes: - Assorted small fixes and updates for tegra, host1x, imagination, nouveau, panfrost, panthor, panel/ili9341, mali, exynos, panel/samsung-s6e3fa7, ast, bridge/ti-sn65dsi86, panel/himax-hx83112a, bridge/tc358767, bridge/imx8mp-hdmi-tx, panel/khadas-ts050, panel/nt36523, panel/sony-acx565akm, kmb, accel/qaic, omap, v3d. - Add bridge/TI TDP158. - Assorted documentation updates. - Convert bochs from simple drm to gem shmem, and check modes against available memory. - Many VC4 fixes, most related to scaling and YUV support. - Convert some drivers to use SYSTEM_SLEEP_PM_OPS and RUNTIME_PM_OPS. - Rockchip 4k@60 support. Signed-off-by: Dave Airlie <airlied@redhat.com> From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/445713a6-2427-4c53-8ec2-3a894ec62405@linux.intel.com
2024-10-08drm/amdgpu: Remove the while loop from amdgpu_job_prepare_jobTvrtko Ursulin
While loop makes it sound like amdgpu_vmid_grab() potentially needs to be called multiple times to produce a fence, while in reality all code paths either return an error, assign a valid job->vmid or assign a vmid which will be valid once the returned fence signals. Therefore we can remove the loop to make it clear the call does not need to be repeated. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Cc: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-10-08drm/amdgpu: Drop impossible condition from amdgpu_job_prepare_jobTvrtko Ursulin
Fence has been initialised to NULL so no need to test it. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Cc: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-10-01drm/amdgpu: update the handle ptr in dump_ip_stateSunil Khatri
Update the ptr handle to amdgpu_ip_block ptr in all the functions. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-25drm/amdgpu: skip coredump after job timeout in SRIOVZhenGuo Yin
VF FLR will be triggered by host driver before job timeout, hence the error status of GPU get cleared. Performing a coredump here is unnecessary. Signed-off-by: ZhenGuo Yin <zhenguo.yin@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-11Merge drm/drm-next into drm-misc-nextThomas Zimmermann
Backmerging to get fixes from v6.12-rc7. Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
2024-09-06drm/sched: add optional errno to drm_sched_start()Christian König
The current implementation of drm_sched_start uses a hardcoded -ECANCELED to dispose of a job when the parent/hw fence is NULL. This results in drm_sched_job_done being called with -ECANCELED for each job with a NULL parent in the pending list, making it difficult to distinguish between recovery methods, whether a queue reset or a full GPU reset was used. To improve this, we first try a soft recovery for timeout jobs and use the error code -ENODATA. If soft recovery fails, we proceed with a queue reset, where the error code remains -ENODATA for the job. Finally, for a full GPU reset, we use error codes -ECANCELED or -ETIME. This patch adds an error code parameter to drm_sched_start, allowing us to differentiate between queue reset and GPU reset failures. This enables user mode and test applications to validate the expected correctness of the requested operation. After a successful queue reset, the only way to continue normal operation is to call drm_sched_job_done with the specific error code -ENODATA. v1: Initial implementation by Jesse utilized amdgpu_device_lock_reset_domain and amdgpu_device_unlock_reset_domain to allow user mode to track the queue reset status and distinguish between queue reset and GPU reset. v2: Christian suggested using the error codes -ENODATA for queue reset and -ECANCELED or -ETIME for GPU reset, returned to amdgpu_cs_wait_ioctl. v3: To meet the requirements, we introduce a new function drm_sched_start_ex with an additional parameter to set dma_fence_set_error, allowing us to handle the specific error codes appropriately and dispose of bad jobs with the selected error code depending on whether it was a queue reset or GPU reset. v4: Alex suggested using a new name, drm_sched_start_with_recovery_error, which more accurately describes the function's purpose. Additionally, it was recommended to add documentation details about the new method. v5: Fixed declaration of new function drm_sched_start_with_recovery_error.(Alex) v6 (chk): rebase on upstream changes, cleanup the commit message, drop the new function again and update all callers, apply the errno also to scheduler fences with hw fences v7 (chk): rebased Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com> Signed-off-by: Christian König <christian.koenig@amd.com> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240826122541.85663-1-christian.koenig@amd.com
2024-08-29drm/amdgpu: Move the dumping log out of for loopSunil Khatri
log message "Dumping IP State Completed" needs to be logged only once when state dumping is complete. Hence moving it out of the for loop. Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Acked-by: Trigger Huang <Trigger.Huang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-29drm/amdgpu: Do core dump immediately when job tmoTrigger Huang
Do the coredump immediately after a job timeout to get a closer representation of GPU's error status. V2: This will skip printing vram_lost as the GPU reset is not happened yet (Alex) V3: Unconditionally call the core dump as we care about all the reset functions(soft-recovery and queue reset and full adapter reset, Alex) V4: Do the dump after adev->job_hang = true (Sunil) Signed-off-by: Trigger Huang <Trigger.Huang@amd.com> Acked-by: Sunil Khatri <sunil.khatri@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-27Merge tag 'amd-drm-next-6.12-2024-08-26' of ↵Daniel Vetter
https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-6.12-2024-08-26: amdgpu: - SDMA devcoredump support - DCN 4.0.1 updates - DC SUBVP fixes - Refactor OPP in DC - Refactor MMHUBBUB in DC - DC DML 2.1 updates - DC FAMS2 updates - RAS updates - GFX12 updates - VCN 4.0.3 updates - JPEG 4.0.3 updates - Enable wave kill (soft recovery) for compute queues - Clean up CP error interrupt handling - Enable CP bad opcode interrupts - VCN 4.x fixes - VCN 5.x fixes - GPU reset fixes - Fix vbios embedded EDID size handling - SMU 14.x updates - Misc code cleanups and spelling fixes - VCN devcoredump support - ISP MFD i2c support - DC vblank fixes - GFX 12 fixes - PSR fixes - Convert vbios embedded EDID to drm_edid - DCN 3.5 updates - DMCUB updates - Cursor fixes - Overdrive support for SMU 14.x - GFX CP padding optimizations - DCC fixes - DSC fixes - Preliminary per queue reset infrastructure - Initial per queue reset support for GFX 9 - Initial per queue reset support for GFX 7, 8 - DCN 3.2 fixes - DP MST fixes - SR-IOV fixes - GFX 9.4.3/4 devcoredump support - Add process isolation framework - Enable process isolation support for GFX 9.4.3/4 - Take IOMMU remapping into account for P2P DMA checks amdkfd: - CRIU fixes - Improved input validation for user queues - HMM fix - Enable process isolation support for GFX 9.4.3/4 - Initial per queue reset support for GFX 9 - Allow users to target recommended SDMA engines radeon: - remove .load and drm_dev_alloc - Fix vbios embedded EDID size handling - Convert vbios embedded EDID to drm_edid - Use GEM references instead of TTM - r100 cp init cleanup - Fix potential overflows in evergreen CS offset tracking UAPI: - KFD support for targetting queues on recommended SDMA engines Proposed userspace: https://github.com/ROCm/ROCR-Runtime/commit/2f588a24065f41c208c3701945e20be746d8faf7 https://github.com/ROCm/ROCR-Runtime/commit/eb30a5bbc7719c6ffcf2d2dd2878bc53a47b3f30 drm/buddy: - Add start address support for trim function From: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240826201528.55307-1-alexander.deucher@amd.com
2024-08-16drm/amdgpu: increase the reset counter for the queue resetPrike Liang
Update the reset counter for the amdgpu_cs_query_reset_state() Acked-by: Vitaly Prosyak <vitaly.prosyak@amd.com> Signed-off-by: Prike Liang <Prike.Liang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-16drm/amdgpu: add per ring reset support (v5)Alex Deucher
If a specific job is hung, try and reset just the ring associated with the job. v2: move to amdgpu_job.c v3: fix drm_sched_stop() handling when ring reset fails v4: drop unnecessary amdgpu_fence_driver_clear_job_fences() and drm_sched_increase_karma() v5: rework sched_stop handling Acked-by: Vitaly Prosyak <vitaly.prosyak@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-08-07drm/amdgpu: Forward soft recovery errors to userspaceJoshua Ashton
As we discussed before[1], soft recovery should be forwarded to userspace, or we can get into a really bad state where apps will keep submitting hanging command buffers cascading us to a hard reset. 1: https://lore.kernel.org/all/bf23d5ed-9a6b-43e7-84ee-8cbfd0d60f18@froggi.es/ Signed-off-by: Joshua Ashton <joshua@froggi.es> Reviewed-by: Marek Olšák <marek.olsak@amd.com> Signed-off-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 434967aadbbbe3ad9103cc29e9a327de20fdba01) Cc: stable@vger.kernel.org
2024-08-06drm/amdgpu: Forward soft recovery errors to userspaceJoshua Ashton
As we discussed before[1], soft recovery should be forwarded to userspace, or we can get into a really bad state where apps will keep submitting hanging command buffers cascading us to a hard reset. 1: https://lore.kernel.org/all/bf23d5ed-9a6b-43e7-84ee-8cbfd0d60f18@froggi.es/ Signed-off-by: Joshua Ashton <joshua@froggi.es> Reviewed-by: Marek Olšák <marek.olsak@amd.com> Signed-off-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-07-10drm/amdgpu/job: Replace DRM_INFO/ERROR loggingAlex Deucher
Use the dev_info/err variants so we get per device logging in multi-GPU cases. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-06-14drm/amdgpu: add reset source in various casesEric Huang
To fullfill the reset event description. Suggested-by: Lijo Lazar <Lijo.Lazar@amd.com> Signed-off-by: Eric Huang <jinhuieric.huang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-04-09drm/amdgpu: Refine IB schedule error loggingLijo Lazar
Downgrade to debug information when IBs are skipped. Also, use dev_* to identify the device. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-04drm/amdgpu: change vm->task_info handlingShashank Sharma
This patch changes the handling and lifecycle of vm->task_info object. The major changes are: - vm->task_info is a dynamically allocated ptr now, and its uasge is reference counted. - introducing two new helper funcs for task_info lifecycle management - amdgpu_vm_get_task_info: reference counts up task_info before returning this info - amdgpu_vm_put_task_info: reference counts down task_info - last put to task_info() frees task_info from the vm. This patch also does logistical changes required for existing usage of vm->task_info. V2: Do not block all the prints when task_info not found (Felix) V3: Fixed review comments from Felix - Fix wrong indentation - No debug message for -ENOMEM - Add NULL check for task_info - Do not duplicate the debug messages (ti vs no ti) - Get first reference of task_info in vm_init(), put last in vm_fini() V4: Fixed review comments from Felix - fix double reference increment in create_task_info - change amdgpu_vm_get_task_info_pasid - additional changes in amdgpu_gem.c while porting Cc: Christian Koenig <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Shashank Sharma <shashank.sharma@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-24drm/sched: Reverse run-queue priority enumerationLuben Tuikov
Reverse run-queue priority enumeration such that the higest priority is now 0, and for each consecutive integer the prioirty diminishes. Run-queues correspond to priorities. To an external observer a scheduler created with a single run-queue, and another created with DRM_SCHED_PRIORITY_COUNT number of run-queues, should always schedule sched->sched_rq[0] with the same "priority", as that index run-queue exists in both schedulers, i.e. a scheduler with one run-queue or many. This patch makes it so. In other words, the "priority" of sched->sched_rq[n], n >= 0, is the same for any scheduler created with any allowable number of run-queues (priorities), 0 to DRM_SCHED_PRIORITY_COUNT. Cc: Rob Clark <robdclark@gmail.com> Cc: Abhinav Kumar <quic_abhinavk@quicinc.com> Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Cc: Danilo Krummrich <dakr@redhat.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: linux-arm-msm@vger.kernel.org Cc: freedreno@lists.freedesktop.org Cc: dri-devel@lists.freedesktop.org Signed-off-by: Luben Tuikov <ltuikov89@gmail.com> Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20231124052752.6915-6-ltuikov89@gmail.com