summaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/amd/amdgpu
AgeCommit message (Collapse)Author
2025-02-12drm/amdgpu: add dynamic workload profile switching for gfx11Alex Deucher
Enable dynamic workload profile switching for gfx11. Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-02-12drm/amdgpu: add dynamic workload profile switching for gfx10Alex Deucher
Enable dynamic workload profile switching for gfx10. Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-02-12drm/amdgpu/gfx: add ring helpers for setting workload profileAlex Deucher
Add helpers to switch the workload profile dynamically when commands are submitted. This allows us to switch to the FULLSCREEN3D or COMPUTE profile when work is submitted. Add a delayed work handler to delay switching out of the selected profile if additional work comes in. This works the same as the VIDEO profile for VCN. This lets dynamically enable workload profiles on the fly and then move back to the default when there is no work. Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-02-12drm/amdkfd: Have kfd driver use same PASID values from graphic driverXiaogang Chen
Current kfd driver has its own PASID value for a kfd process and uses it to locate vm at interrupt handler or mapping between kfd process and vm. That design is not working when a physical gpu device has multiple spatial partitions, ex: adev in CPX mode. This patch has kfd driver use same pasid values that graphic driver generated which is per vm per pasid. These pasid values are passed to fw/hardware. We do not need change interrupt handler though more pasid values are used. Also, pasid values at log are replaced by user process pid; pasid values are not exposed to user. Users see their process pids that have meaning in user space. Signed-off-by: Xiaogang Chen <xiaogang.chen@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-02-12drm/amdgpu: Check RRMT status for JPEG v4.0.3Lijo Lazar
RRMT could get dynamically enabled/disabled by PSP firmware. Read the status from register for reading RRMT status. For VFs, this is not accessible, hence assume that it's always disabled for now. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-02-12drm/amdgpu: Check RRMT status for VCN v4.0.3Lijo Lazar
RRMT could get dynamically enabled/disabled by PSP firmware. Read the status from register for reading RRMT status. For VFs, this is not accessible, hence assume that it's always disabled for now. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Sathishkumar S <sathishkumar.sundararaju@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-02-12drm/amdgpu: add support for PSP IP version 14.0.5Tim Huang
This initializes PSP IP version 14.0.5. Signed-off-by: Tim Huang <tim.huang@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-02-12drm/amdgpu: add support for SMU IP version 14.0.5Tim Huang
This initializes SMU IP version 14.0.5. Signed-off-by: Tim Huang <tim.huang@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-02-12drm/amdgpu: enable VCN/JPEG CGPG for GC IP version 11.5.3Tim Huang
Enable VCN/JPEG CGPG for ASIC with GFX version 11.5.3. Signed-off-by: Saleemkhan Jamadar <saleemkhan.jamadar@amd.com> Signed-off-by: Tim Huang <tim.huang@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-02-12drm/amdgpu: add support for MMHUB IP version 3.3.2Tim Huang
This initializes MMHUB IP version 3.3.2. Signed-off-by: Tim Huang <tim.huang@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-02-12drm/amdgpu: add support for NBIO IP version 7.11.2Tim Huang
This initializes NBIO IP version 7.11.2. Signed-off-by: Tim Huang <tim.huang@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-02-12drm/amdgpu: add support for SDMA IP version 6.1.3Tim Huang
This initializes SDMA IP version 6.1.3. Signed-off-by: Tim Huang <tim.huang@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-02-12drm/amdgpu: add support for GC IP version 11.5.3Tim Huang
This initializes GC IP version 11.5.3. Signed-off-by: Tim Huang <tim.huang@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-02-12drm/amdgpu: add OEM i2c bus for polaris chipsAlex Deucher
It uses the VGADCC bus. DC doesn't use this bus, so it is safe to add it here. Reviewed-by: Harry Wentland <harry.wentland@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-02-12drm/amdgpu: rework i2c init and finiAlex Deucher
No functional change. Rework the code to allow for adding some additional i2c buses in conjunction with DC in the future. Reviewed-by: Harry Wentland <harry.wentland@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-02-12drm/amdgpu/atombios: drop empty functionAlex Deucher
This was leftover from when amdgpu was forked from radeon. The function is empty so drop it. Reviewed-by: Harry Wentland <harry.wentland@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-02-12drm/amd/display/dm: handle OEM i2c buses in i2c functionsAlex Deucher
Allow the creation of an OEM i2c bus and use the proper DC helpers for that case. Reviewed-by: Harry Wentland <harry.wentland@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-02-12drm/amdgpu: increase amdgpu max rings limitSathishkumar S
increase max rings to 132 to support all JPEG5_0_1 cores, else ring_init fails due to ring count exceeding maximum limit. Signed-off-by: Sathishkumar S <sathishkumar.sundararaju@amd.com> Reviewed-by: Leo Liu <leo.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-02-12drm/amdgpu: bail out when failed to load fw in psp_init_cap_microcode()Jiang Liu
In function psp_init_cap_microcode(), it should bail out when failed to load firmware, otherwise it may cause invalid memory access. Fixes: 07dbfc6b102e ("drm/amd: Use `amdgpu_ucode_*` helpers for PSP") Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Jiang Liu <gerry@linux.alibaba.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-02-12drm/amdgpu: bump version for RV/PCO compute fixAlex Deucher
Bump the driver version for RV/PCO compute stability fix so mesa can use this check to enable compute queues on RV/PCO. Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org # 6.12.x
2025-02-12drm/amdgpu/gfx9: manually control gfxoff for CS on RVAlex Deucher
When mesa started using compute queues more often we started seeing additional hangs with compute queues. Disabling gfxoff seems to mitigate that. Manually control gfxoff and gfx pg with command submissions to avoid any issues related to gfxoff. KFD already does the same thing for these chips. v2: limit to compute v3: limit to APUs v4: limit to Raven/PCO v5: only update the compute ring_funcs v6: Disable GFX PG v7: adjust order Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Suggested-by: Błażej Szczygieł <mumei6102@gmail.com> Suggested-by: Sergey Kovalenko <seryoga.engineering@gmail.com> Link: https://gitlab.freedesktop.org/drm/amd/-/issues/3861 Link: https://lists.freedesktop.org/archives/amd-gfx/2025-January/119116.html Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org # 6.12.x
2025-02-12drm/sched: Use struct for drm_sched_init() paramsPhilipp Stanner
drm_sched_init() has a great many parameters and upcoming new functionality for the scheduler might add even more. Generally, the great number of parameters reduces readability and has already caused one missnaming, addressed in: commit 6f1cacf4eba7 ("drm/nouveau: Improve variable name in nouveau_sched_init()"). Introduce a new struct for the scheduler init parameters and port all users. Reviewed-by: Liviu Dudau <liviu.dudau@arm.com> Acked-by: Matthew Brost <matthew.brost@intel.com> # for Xe Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com> # for Panfrost and Panthor Reviewed-by: Christian Gmeiner <cgmeiner@igalia.com> # for Etnaviv Reviewed-by: Frank Binns <frank.binns@imgtec.com> # for Imagination Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> # for Sched Reviewed-by: Maíra Canal <mcanal@igalia.com> # for v3d Reviewed-by: Danilo Krummrich <dakr@kernel.org> Reviewed-by: Lizhi Hou <lizhi.hou@amd.com> # for amdxdna Signed-off-by: Philipp Stanner <phasta@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20250211111422.21235-2-phasta@kernel.org
2025-02-06Merge drm/drm-next into drm-misc-nextMaxime Ripard
Bring rc1 to start the new release dev. Signed-off-by: Maxime Ripard <mripard@kernel.org>
2025-02-03drm/amdgpu: add a BO metadata flag to disable write compression for VulkanMarek Olšák
Vulkan can't support DCC and Z/S compression on GFX12 without WRITE_COMPRESS_DISABLE in this commit or a completely different DCC interface. AMDGPU_TILING_GFX12_SCANOUT is added because it's already used by userspace. Cc: stable@vger.kernel.org # 6.12.x Signed-off-by: Marek Olšák <marek.olsak@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-28drm/amd/amdgpu: change the config of cgcg on gfx12Kenneth Feng
change the config of cgcg on gfx12 Signed-off-by: Kenneth Feng <kenneth.feng@amd.com> Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org # 6.12.x
2025-01-24drm/amd/amdgpu: Enable scratch data dump for mes 12Shaoyun Liu
MES internal will check CP_MES_MSCRATCH_LO/HI register to set scratch data location during ucode start, driver side need to start the MES one by one with different setting for each pipe Signed-off-by: Shaoyun Liu <shaoyun.liu@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-24drm/amd: Clarify kdoc for amdgpu.gttsizeMario Limonciello
Effectively amdgpu.gttsize gets set to ~1/2 of RAM, but that's controlled by what the TTM page limit is set to. Clarify the kdoc. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-24drm/amd/amdgpu: Prevent null pointer dereference in GPU bandwidth calculationSrinivasan Shanmugam
If the parent is NULL, adev->pdev is used to retrieve the PCIe speed and width, ensuring that the function can still determine these capabilities from the device itself. Fixes the below: drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:6193 amdgpu_device_gpu_bandwidth() error: we previously assumed 'parent' could be null (see line 6180) drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 6170 static void amdgpu_device_gpu_bandwidth(struct amdgpu_device *adev, 6171 enum pci_bus_speed *speed, 6172 enum pcie_link_width *width) 6173 { 6174 struct pci_dev *parent = adev->pdev; 6175 6176 if (!speed || !width) 6177 return; 6178 6179 parent = pci_upstream_bridge(parent); 6180 if (parent && parent->vendor == PCI_VENDOR_ID_ATI) { ^^^^^^ If parent is NULL 6181 /* use the upstream/downstream switches internal to dGPU */ 6182 *speed = pcie_get_speed_cap(parent); 6183 *width = pcie_get_width_cap(parent); 6184 while ((parent = pci_upstream_bridge(parent))) { 6185 if (parent->vendor == PCI_VENDOR_ID_ATI) { 6186 /* use the upstream/downstream switches internal to dGPU */ 6187 *speed = pcie_get_speed_cap(parent); 6188 *width = pcie_get_width_cap(parent); 6189 } 6190 } 6191 } else { 6192 /* use the device itself */ --> 6193 *speed = pcie_get_speed_cap(parent); ^^^^^^ Then we are toasted here. 6194 *width = pcie_get_width_cap(parent); 6195 } 6196 } Fixes: 757e8b951ce2 ("drm/amdgpu: cache gpu pcie link width") Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Suggested-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-24drm/amdgpu: fix ring timeout issue in gfx10 sr-iov environmentLin.Cao
commit 26c95e838e63 ("drm/amdgpu: set the VM pointer to NULL in amdgpu_job_prepare") set job->vm as NULL if there is no fence. It will cause emit switch buffer be skippen if job->vm set as NULL. Check job rather than vm could solve this problem. Fixes: 26c95e838e63 ("drm/amdgpu: set the VM pointer to NULL in amdgpu_job_prepare") Signed-off-by: Lin.Cao <lincao12@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-24drm/amdgpu: fix the PCIe lanes reporting in the INFO IOCTLAlex Deucher
Combine the platform and GPU caps like we do for PCIe Gen. This aligns properly with expectations and documentation for the interface. Link: https://gitlab.freedesktop.org/drm/amd/-/issues/3820 Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-24drm/amdgpu: cache gpu pcie link widthAlex Deucher
Get the PCIe link with of the device itself (or it's integrated upstream bridge) and cache that. v2: fix typo Link: https://gitlab.freedesktop.org/drm/amd/-/issues/3820 Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-24drm/amdgpu: Refine ip detection log messageLijo Lazar
'add ip block' causes a confusion if the blocks are disabled later with ip_block_mask. Instead change to 'detected' and also add device context. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com> Suggested-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-24drm/amdgpu: Add handler for SDMA context emptyLijo Lazar
Context empty interrupt is enabled for SDMA 4.4.2. Add a handler for context empty interrupt so that it is disposed of fast, and not propagated to KFD layer. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Suggested-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-14drm/amdgpu/gfx12: Add Cleaner Shader Support for GFX12.0 GPUsSrinivasan Shanmugam
This commit enables the cleaner shader feature for GFX12.0 and GFX12.0.1 GPUs. The cleaner shader is important for clearing GPU resources such as Local Data Share (LDS), Vector General Purpose Registers (VGPRs), and Scalar General Purpose Registers (SGPRs) between workloads. - This feature ensures that GPU resources are reset between workloads, preventing data leaks and ensuring accurate computation. By enabling the cleaner shader, this update enhances the security and reliability of GPU operations on GFX12.0 hardware. Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-14drm/amdgpu: Mark debug KFD module params as unsafeKent Russell
Mark options only meant to be used for debugging as unsafe so that the kernel is tainted when they are used. Signed-off-by: Kent Russell <kent.russell@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-14drm/amdgpu: fix fw attestation for MP0_14_0_{2/3}Gui Chengming
FW attestation was disabled on MP0_14_0_{2/3}. V2: Move check into is_fw_attestation_support func. (Frank) Remove DRM_WARN log info. (Alex) Fix format. (Christian) Signed-off-by: Gui Chengming <Jack.Gui@amd.com> Reviewed-by: Frank.Min <Frank.Min@amd.com> Reviewed-by: Christian König <Christian.Koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-14drm/amdgpu: always sync the GFX pipe on ctx switchChristian König
That is needed to enforce isolation between contexts. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-14drm/amdgpu: mark a bunch of module parameters unsafeChristian König
We sometimes have people trying to use debugging options in production environments. Mark options only meant to be used for debugging as unsafe so that the kernel is tainted when they are used. Signed-off-by: Christian König <christian.koenig@amd.com> Acked-by: Felix Kuehling <felix.kuehling@amd.com> Acked-by: Simona Vetter <simona.vetter@ffwll.ch> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-14drm/amdgpu: Use DRM scheduler API in amdgpu_xcp_release_schedTvrtko Ursulin
Lets use the existing helper instead of peeking into the structure directly. Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com> Cc: Christian König <christian.koenig@amd.com> Cc: Danilo Krummrich <dakr@redhat.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Philipp Stanner <pstanner@redhat.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-14drm/amdgpu: disable gfxoff with the compute workload on gfx12Kenneth Feng
Disable gfxoff with the compute workload on gfx12. This is a workaround for the opencl test failure. Signed-off-by: Kenneth Feng <kenneth.feng@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-14drm/amdgpu: Fix Circular Locking Dependency in AMDGPU GFX IsolationSrinivasan Shanmugam
This commit addresses a circular locking dependency issue within the GFX isolation mechanism. The problem was identified by a warning indicating a potential deadlock due to inconsistent lock acquisition order. - The `amdgpu_gfx_enforce_isolation_ring_begin_use` and `amdgpu_gfx_enforce_isolation_ring_end_use` functions previously acquired `enforce_isolation_mutex` and called `amdgpu_gfx_kfd_sch_ctrl`, leading to potential deadlocks. ie., If `amdgpu_gfx_kfd_sch_ctrl` is called while `enforce_isolation_mutex` is held, and `amdgpu_gfx_enforce_isolation_handler` is called while `kfd_sch_mutex` is held, it can create a circular dependency. By ensuring consistent lock usage, this fix resolves the issue: [ 606.297333] ====================================================== [ 606.297343] WARNING: possible circular locking dependency detected [ 606.297353] 6.10.0-amd-mlkd-610-311224-lof #19 Tainted: G OE [ 606.297365] ------------------------------------------------------ [ 606.297375] kworker/u96:3/3825 is trying to acquire lock: [ 606.297385] ffff9aa64e431cb8 ((work_completion)(&(&adev->gfx.enforce_isolation[i].work)->work)){+.+.}-{0:0}, at: __flush_work+0x232/0x610 [ 606.297413] but task is already holding lock: [ 606.297423] ffff9aa64e432338 (&adev->gfx.kfd_sch_mutex){+.+.}-{3:3}, at: amdgpu_gfx_kfd_sch_ctrl+0x51/0x4d0 [amdgpu] [ 606.297725] which lock already depends on the new lock. [ 606.297738] the existing dependency chain (in reverse order) is: [ 606.297749] -> #2 (&adev->gfx.kfd_sch_mutex){+.+.}-{3:3}: [ 606.297765] __mutex_lock+0x85/0x930 [ 606.297776] mutex_lock_nested+0x1b/0x30 [ 606.297786] amdgpu_gfx_kfd_sch_ctrl+0x51/0x4d0 [amdgpu] [ 606.298007] amdgpu_gfx_enforce_isolation_ring_begin_use+0x2a4/0x5d0 [amdgpu] [ 606.298225] amdgpu_ring_alloc+0x48/0x70 [amdgpu] [ 606.298412] amdgpu_ib_schedule+0x176/0x8a0 [amdgpu] [ 606.298603] amdgpu_job_run+0xac/0x1e0 [amdgpu] [ 606.298866] drm_sched_run_job_work+0x24f/0x430 [gpu_sched] [ 606.298880] process_one_work+0x21e/0x680 [ 606.298890] worker_thread+0x190/0x350 [ 606.298899] kthread+0xe7/0x120 [ 606.298908] ret_from_fork+0x3c/0x60 [ 606.298919] ret_from_fork_asm+0x1a/0x30 [ 606.298929] -> #1 (&adev->enforce_isolation_mutex){+.+.}-{3:3}: [ 606.298947] __mutex_lock+0x85/0x930 [ 606.298956] mutex_lock_nested+0x1b/0x30 [ 606.298966] amdgpu_gfx_enforce_isolation_handler+0x87/0x370 [amdgpu] [ 606.299190] process_one_work+0x21e/0x680 [ 606.299199] worker_thread+0x190/0x350 [ 606.299208] kthread+0xe7/0x120 [ 606.299217] ret_from_fork+0x3c/0x60 [ 606.299227] ret_from_fork_asm+0x1a/0x30 [ 606.299236] -> #0 ((work_completion)(&(&adev->gfx.enforce_isolation[i].work)->work)){+.+.}-{0:0}: [ 606.299257] __lock_acquire+0x16f9/0x2810 [ 606.299267] lock_acquire+0xd1/0x300 [ 606.299276] __flush_work+0x250/0x610 [ 606.299286] cancel_delayed_work_sync+0x71/0x80 [ 606.299296] amdgpu_gfx_kfd_sch_ctrl+0x287/0x4d0 [amdgpu] [ 606.299509] amdgpu_gfx_enforce_isolation_ring_begin_use+0x2a4/0x5d0 [amdgpu] [ 606.299723] amdgpu_ring_alloc+0x48/0x70 [amdgpu] [ 606.299909] amdgpu_ib_schedule+0x176/0x8a0 [amdgpu] [ 606.300101] amdgpu_job_run+0xac/0x1e0 [amdgpu] [ 606.300355] drm_sched_run_job_work+0x24f/0x430 [gpu_sched] [ 606.300369] process_one_work+0x21e/0x680 [ 606.300378] worker_thread+0x190/0x350 [ 606.300387] kthread+0xe7/0x120 [ 606.300396] ret_from_fork+0x3c/0x60 [ 606.300406] ret_from_fork_asm+0x1a/0x30 [ 606.300416] other info that might help us debug this: [ 606.300428] Chain exists of: (work_completion)(&(&adev->gfx.enforce_isolation[i].work)->work) --> &adev->enforce_isolation_mutex --> &adev->gfx.kfd_sch_mutex [ 606.300458] Possible unsafe locking scenario: [ 606.300468] CPU0 CPU1 [ 606.300476] ---- ---- [ 606.300484] lock(&adev->gfx.kfd_sch_mutex); [ 606.300494] lock(&adev->enforce_isolation_mutex); [ 606.300508] lock(&adev->gfx.kfd_sch_mutex); [ 606.300521] lock((work_completion)(&(&adev->gfx.enforce_isolation[i].work)->work)); [ 606.300536] *** DEADLOCK *** [ 606.300546] 5 locks held by kworker/u96:3/3825: [ 606.300555] #0: ffff9aa5aa1f5d58 ((wq_completion)comp_1.1.0){+.+.}-{0:0}, at: process_one_work+0x3f5/0x680 [ 606.300577] #1: ffffaa53c3c97e40 ((work_completion)(&sched->work_run_job)){+.+.}-{0:0}, at: process_one_work+0x1d6/0x680 [ 606.300600] #2: ffff9aa64e463c98 (&adev->enforce_isolation_mutex){+.+.}-{3:3}, at: amdgpu_gfx_enforce_isolation_ring_begin_use+0x1c3/0x5d0 [amdgpu] [ 606.300837] #3: ffff9aa64e432338 (&adev->gfx.kfd_sch_mutex){+.+.}-{3:3}, at: amdgpu_gfx_kfd_sch_ctrl+0x51/0x4d0 [amdgpu] [ 606.301062] #4: ffffffff8c1a5660 (rcu_read_lock){....}-{1:2}, at: __flush_work+0x70/0x610 [ 606.301083] stack backtrace: [ 606.301092] CPU: 14 PID: 3825 Comm: kworker/u96:3 Tainted: G OE 6.10.0-amd-mlkd-610-311224-lof #19 [ 606.301109] Hardware name: Gigabyte Technology Co., Ltd. X570S GAMING X/X570S GAMING X, BIOS F7 03/22/2024 [ 606.301124] Workqueue: comp_1.1.0 drm_sched_run_job_work [gpu_sched] [ 606.301140] Call Trace: [ 606.301146] <TASK> [ 606.301154] dump_stack_lvl+0x9b/0xf0 [ 606.301166] dump_stack+0x10/0x20 [ 606.301175] print_circular_bug+0x26c/0x340 [ 606.301187] check_noncircular+0x157/0x170 [ 606.301197] ? register_lock_class+0x48/0x490 [ 606.301213] __lock_acquire+0x16f9/0x2810 [ 606.301230] lock_acquire+0xd1/0x300 [ 606.301239] ? __flush_work+0x232/0x610 [ 606.301250] ? srso_alias_return_thunk+0x5/0xfbef5 [ 606.301261] ? mark_held_locks+0x54/0x90 [ 606.301274] ? __flush_work+0x232/0x610 [ 606.301284] __flush_work+0x250/0x610 [ 606.301293] ? __flush_work+0x232/0x610 [ 606.301305] ? __pfx_wq_barrier_func+0x10/0x10 [ 606.301318] ? mark_held_locks+0x54/0x90 [ 606.301331] ? srso_alias_return_thunk+0x5/0xfbef5 [ 606.301345] cancel_delayed_work_sync+0x71/0x80 [ 606.301356] amdgpu_gfx_kfd_sch_ctrl+0x287/0x4d0 [amdgpu] [ 606.301661] amdgpu_gfx_enforce_isolation_ring_begin_use+0x2a4/0x5d0 [amdgpu] [ 606.302050] ? srso_alias_return_thunk+0x5/0xfbef5 [ 606.302069] amdgpu_ring_alloc+0x48/0x70 [amdgpu] [ 606.302452] amdgpu_ib_schedule+0x176/0x8a0 [amdgpu] [ 606.302862] ? drm_sched_entity_error+0x82/0x190 [gpu_sched] [ 606.302890] amdgpu_job_run+0xac/0x1e0 [amdgpu] [ 606.303366] drm_sched_run_job_work+0x24f/0x430 [gpu_sched] [ 606.303388] process_one_work+0x21e/0x680 [ 606.303409] worker_thread+0x190/0x350 [ 606.303424] ? __pfx_worker_thread+0x10/0x10 [ 606.303437] kthread+0xe7/0x120 [ 606.303449] ? __pfx_kthread+0x10/0x10 [ 606.303463] ret_from_fork+0x3c/0x60 [ 606.303476] ? __pfx_kthread+0x10/0x10 [ 606.303489] ret_from_fork_asm+0x1a/0x30 [ 606.303512] </TASK> v2: Refactor lock handling to resolve circular dependency (Alex) - Introduced a `sched_work` flag to defer the call to `amdgpu_gfx_kfd_sch_ctrl` until after releasing `enforce_isolation_mutex`. - This change ensures that `amdgpu_gfx_kfd_sch_ctrl` is called outside the critical section, preventing the circular dependency and deadlock. - The `sched_work` flag is set within the mutex-protected section if conditions are met, and the actual function call is made afterward. - This approach ensures consistent lock acquisition order. Fixes: afefd6f24502 ("drm/amdgpu: Implement Enforce Isolation Handler for KGD/KFD serialization") Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Suggested-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-13Merge tag 'amd-drm-next-6.14-2025-01-10' of ↵Dave Airlie
https://gitlab.freedesktop.org/agd5f/linux into drm-next amd-drm-next-6.14-2025-01-10: amdgpu: - Fix max surface handling in DC - clang fixes - DCN 3.5 fixes - DCN 4.0.1 fixes - DC CRC fixes - DML updates - DSC fixes - PSR fixes - DC add some divide by 0 checks - SMU13 updates - SR-IOV fixes - RAS fixes - Cleaner shader support for gfx10.3 dGPUs - fix drm buddy trim handling - SDMA engine reset updates _ Fix RB bitmap setup - Fix doorbell ttm cleanup - Add CEC notifier support - DPIA updates - MST fixes amdkfd: - Shader debugger fixes - Trap handler cleanup - Cleanup includes - Eviction fence wq fix Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexander.deucher@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250110172731.2960668-1-alexander.deucher@amd.com
2025-01-09drm/amdgpu: fill the ucode bo during psp resume for SRIOVVictor Zhao
refill the ucode bo during psp resume for SRIOV, otherwise ucode load will fail after VM hibernation and fb clean. Signed-off-by: Victor Zhao <Victor.Zhao@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-09drm/amdgpu/gfx10: Enable cleaner shader for GFX10.3.2/10.3.4/10.3.5 GPUsSrinivasan Shanmugam
Enable the cleaner shader for GFX10.3.2/10.3.4/10.3.5 GPUs to provide data isolation between GPU workloads. The cleaner shader is responsible for clearing the Local Data Store (LDS), Vector General Purpose Registers (VGPRs), and Scalar General Purpose Registers (SGPRs), which helps prevent data leakage and ensures accurate computation results. This update extends cleaner shader support to GFX10.3.2/10.3.4/10.3.5 GPUs, previously available for GFX10.3.0. It enhances security by clearing GPU memory between processes and maintains a consistent GPU state across KGD and KFD workloads. Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-09drm/amdgpu: fix gpu recovery disable with per queue resetJonathan Kim
Per queue reset should be bypassed when gpu recovery is disabled with module parameter. Fixes: ee0a469cf917 ("drm/amdkfd: support per-queue reset on gfx9") Signed-off-by: Jonathan Kim <jonathan.kim@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-09drm/amdgpu: wrong array index to get ip block for PSPJiang Liu
The adev->ip_blocks array is not indexed by AMD_IP_BLOCK_TYPE_xxx, instead we should call amdgpu_device_ip_get_ip_block() to get the corresponding IP block oject. Fix some checkpatch issues (Alex) Signed-off-by: Jiang Liu <gerry@linux.alibaba.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-09drm/amdgpu: tear down ttm range manager for doorbell in amdgpu_ttm_fini()Jiang Liu
Tear down ttm range manager for doorbell in function amdgpu_ttm_fini(), to avoid memory leakage. Fixes: 792b84fb9038 ("drm/amdgpu: initialize ttm for doorbells") Signed-off-by: Jiang Liu <gerry@linux.alibaba.com> Signed-off-by: Kent Russell <kent.russell@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-09drm/amdgpu: fix incorrect number of active RBs for gfx12Tim Huang
The RB bitmap should be global active RB bitmap & active RB bitmap based on active SA. Signed-off-by: Tim Huang <tim.huang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-09drm/amdgpu: fix incorrect active RB bitmap in setup RBsTim Huang
The RB bitmap width per SA may be 0x1 for some ASICs. Use the actual bitmap of SA instead of 0x3 to determine the active RB bitmap. Signed-off-by: Tim Huang <tim.huang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-01-09drm/amdgpu: Fix shift type in amdgpu_debugfs_sdma_sched_mask_set()Dan Carpenter
The "mask" and "val" variables are type u64. The problem is that the BIT() macros are type unsigned long which is just 32 bits on 32bit systems. It's unlikely that people will be using this driver on 32bit kernels and even if they did we only use the lower AMDGPU_MAX_SDMA_INSTANCES (16) bits. So this bug does not affect anything in real life. Still, for correctness sake, u64 bit masks should use BIT_ULL(). Fixes: d2e3961ae371 ("drm/amdgpu: add amdgpu_sdma_sched_mask debugfs") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Link: https://lore.kernel.org/r/d39a9325-87a4-4543-b6ec-1c61fca3a6fc@stanley.mountain Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>