Age | Commit message (Collapse) | Author |
|
If the MEC firmware supports chaining runlists of XNACK+/XNACK-
processes, set SQ_CONFIG1 chicken bit and SET_RESOURCES bit 28.
When the MEC/HWS supports it, KFD checks the XNACK+/XNACK- processes mix
happens or not. If it does, enter over-subscription.
Signed-off-by: Amber Lin <Amber.Lin@amd.com>
Reviewed-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
On VCN v4.0.5 there is a race condition where the WPTR is not
updated after starting from idle when doorbell is used. Adding
register read-back after written at function end is to ensure
all register writes are done before they can be used.
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/12528
Signed-off-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Tested-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Ruijing Dong <ruijing.dong@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 07c9db090b86e5211188e1b351303fbc673378cf)
Cc: stable@vger.kernel.org
|
|
Expose the debugfs file node for user space to dump the IFWI image
on spirom.
For one transaction between PSP and host, it will read out the
images on both active and inactive partitions so a buffer with two
times the size of maximum IFWI image (currently 16MByte) is needed.
v2: move the vbios gfl macros to the common header and rename the
bo triplet struct to spirom_bo for this specific usage (Hawking)
v3: return directly the result of last command execution (Lijo)
Signed-off-by: Shiwu Zhang <shiwu.zhang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
As the userq resource was already freed at the drm_release
early phase, it should avoid freeing userq resource again
at the later kms postclose callback.
Signed-off-by: Prike Liang <Prike.Liang@amd.com>
Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
A circular locking dependency was detected between the global
`adev->userq_mutex` and per-file `userq_mgr->userq_mutex` when
creating user queues. The issue occurs because:
1. `amdgpu_userq_suspend()` and `amdgpu_userq_resume` take `adev->userq_mutex` first, then
`userq_mgr->userq_mutex`
2. While `amdgpu_userq_create()` takes them in reverse order
This patch resolves the issue by:
1. Moving the `adev->userq_mutex` lock earlier in `amdgpu_userq_create()`
to cover the `amdgpu_userq_ensure_ev_fence()` call
2. Releasing it after we're done with both queue creation and the
scheduling halt check
v2: remove unused adev->userq_mutex lock (Prike)
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Reviewed-by: Prike Liang <Prike.Liang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
On VCN v4.0.5 there is a race condition where the WPTR is not
updated after starting from idle when doorbell is used. Adding
register read-back after written at function end is to ensure
all register writes are done before they can be used.
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/12528
Signed-off-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Tested-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Ruijing Dong <ruijing.dong@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
On GFX1151, the reported MALL cache size reflects only
half of its actual size; this adjustment corrects the discrepancy.
Signed-off-by: Tim Huang <tim.huang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 0a5c060b593ad152318f89e5564bfdfcff8a6ac0)
Cc: stable@vger.kernel.org
|
|
After process exit to unmap csa and free GPU vm, if signal is accepted
and then waiting to take vm lock is interrupted and return, it causes
memory leaking and below warning backtrace.
Change to use uninterruptible wait lock fix the issue.
WARNING: CPU: 69 PID: 167800 at amd/amdgpu/amdgpu_kms.c:1525
amdgpu_driver_postclose_kms+0x294/0x2a0 [amdgpu]
Call Trace:
<TASK>
drm_file_free.part.0+0x1da/0x230 [drm]
drm_close_helper.isra.0+0x65/0x70 [drm]
drm_release+0x6a/0x120 [drm]
amdgpu_drm_release+0x51/0x60 [amdgpu]
__fput+0x9f/0x280
____fput+0xe/0x20
task_work_run+0x67/0xa0
do_exit+0x217/0x3c0
do_group_exit+0x3b/0xb0
get_signal+0x14a/0x8d0
arch_do_signal_or_restart+0xde/0x100
exit_to_user_mode_loop+0xc1/0x1a0
exit_to_user_mode_prepare+0xf4/0x100
syscall_exit_to_user_mode+0x17/0x40
do_syscall_64+0x69/0xc0
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 7dbbfb3c171a6f63b01165958629c9c26abf38ab)
Cc: stable@vger.kernel.org
|
|
Fix DEBUG_LOCKS_WARN_ON(lock->magic != lock) warning logs.
Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
The ttm_bo_pin and ttm_bo_unpin warnings are resolved by moving the
doorbell bo reserve up before pin/unpin.
WARNING: CPU: 11 PID: 1818 at drivers/gpu/drm/ttm/ttm_bo.c:592 ttm_bo_pin+0x1f6/0x270 [ttm]
[ +0.000277] CPU: 11 UID: 1000 PID: 1818 Comm: Xwayland Tainted: G W 6.12.0+ #15
[ +0.000006] Tainted: [W]=WARN
[ +0.000004] Hardware name: ASUS System Product Name/TUF GAMING B650-PLUS, BIOS 3072 12/20/2024
[ +0.000004] RIP: 0010:ttm_bo_pin+0x1f6/0x270 [ttm]
[ +0.000005] RSP: 0018:ffff88846ca879d0 EFLAGS: 00010246
[ +0.000007] RAX: 0000000000000000 RBX: ffff88810b7ca848 RCX: 0000000000000000
[ +0.000004] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ +0.000005] RBP: ffff88846ca879e8 R08: 0000000000000000 R09: 0000000000000000
[ +0.000004] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88810b7ca848
[ +0.000004] R13: ffff88846c666250 R14: 1ffff1108d950f44 R15: ffff88846ca87aa0
[ +0.000005] FS: 00007c45ff436d00(0000) GS:ffff888409580000(0000) knlGS:0000000000000000
[ +0.000004] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ +0.000005] CR2: 00005b0c142a60e0 CR3: 000000012ce5a000 CR4: 0000000000f50ef0
[ +0.000004] PKRU: 55555554
[ +0.000004] Call Trace:
[ +0.000004] <TASK>
[ +0.000005] ? show_regs+0x6c/0x80
[ +0.000007] ? __warn+0xd2/0x2d0
[ +0.000007] ? ttm_bo_pin+0x1f6/0x270 [ttm]
[ +0.000031] ? report_bug+0x282/0x2f0
[ +0.000012] ? handle_bug+0x6e/0xc0
[ +0.000007] ? exc_invalid_op+0x18/0x50
[ +0.000007] ? asm_exc_invalid_op+0x1b/0x20
[ +0.000017] ? ttm_bo_pin+0x1f6/0x270 [ttm]
[ +0.000014] amdgpu_bo_pin+0x365/0x9d0 [amdgpu]
[ +0.000191] ? __pfx_amdgpu_bo_pin+0x10/0x10 [amdgpu]
[ +0.000185] ? drm_gem_object_lookup+0x81/0xc0
[ +0.000008] ? kasan_save_alloc_info+0x37/0x60
[ +0.000007] ? __kasan_kmalloc+0xc3/0xd0
[ +0.000013] amdgpu_userqueue_get_doorbell_index+0xee/0x5f0 [amdgpu]
[ +0.000209] amdgpu_userq_ioctl+0x6b4/0xd40 [amdgpu]
[ +0.000193] ? __pfx_amdgpu_userq_ioctl+0x10/0x10 [amdgpu]
[ +0.000211] ? lock_acquire+0x7c/0xc0
[ +0.000006] ? drm_dev_enter+0x51/0x190
[ +0.000015] drm_ioctl_kernel+0x18b/0x330
[ +0.000007] ? __pfx_amdgpu_userq_ioctl+0x10/0x10 [amdgpu]
[ +0.000190] ? __pfx_drm_ioctl_kernel+0x10/0x10
[ +0.000005] ? lock_acquire+0x7c/0xc0
[ +0.000009] ? srso_alias_return_thunk+0x5/0xfbef5
[ +0.000005] ? __kasan_check_write+0x14/0x30
[ +0.000005] ? srso_alias_return_thunk+0x5/0xfbef5
[ +0.000011] drm_ioctl+0x589/0xd00
[ +0.000005] ? srso_alias_return_thunk+0x5/0xfbef5
[ +0.000006] ? __pfx_amdgpu_userq_ioctl+0x10/0x10 [amdgpu]
[ +0.000194] ? __pfx_drm_ioctl+0x10/0x10
[ +0.000006] ? __pm_runtime_resume+0x80/0x110
[ +0.000021] ? srso_alias_return_thunk+0x5/0xfbef5
[ +0.000005] ? trace_hardirqs_on+0x53/0x60
[ +0.000005] ? srso_alias_return_thunk+0x5/0xfbef5
[ +0.000005] ? _raw_spin_unlock_irqrestore+0x51/0x80
[ +0.000013] amdgpu_drm_ioctl+0xd2/0x1c0 [amdgpu]
[ +0.000185] __x64_sys_ioctl+0x13a/0x1c0
[ +0.000010] x64_sys_call+0x11ad/0x25f0
[ +0.000007] do_syscall_64+0x91/0x180
[ +0.000007] ? srso_alias_return_thunk+0x5/0xfbef5
[ +0.000005] ? irqentry_exit+0x77/0xb0
[ +0.000005] ? srso_alias_return_thunk+0x5/0xfbef5
[ +0.000005] ? exc_page_fault+0x93/0x150
[ +0.000009] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ +0.000005] RIP: 0033:0x7c45ff924ded
[ +0.000005] RSP: 002b:00007ffff7167810 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ +0.000008] RAX: ffffffffffffffda RBX: 00000000c0486456 RCX: 00007c45ff924ded
[ +0.000004] RDX: 00007ffff7167870 RSI: 00000000c0486456 RDI: 000000000000000b
[ +0.000004] RBP: 00007ffff7167860 R08: ffff800100000000 R09: 0000000000010000
[ +0.000005] R10: 00007ffff7167950 R11: 0000000000000246 R12: 00005b0c2a51bc48
[ +0.000004] R13: 000000000000000b R14: 0000000000000000 R15: 00007ffff7167950
[ +0.000022] </TASK>
[ +0.000004] irq event stamp: 80693
[ +0.000004] hardirqs last enabled at (80699): [<ffffffff86a693a9>] __up_console_sem+0x79/0xa0
[ +0.000005] hardirqs last disabled at (80704): [<ffffffff86a6938e>] __up_console_sem+0x5e/0xa0
[ +0.000005] softirqs last enabled at (80390): [<ffffffff8687377e>] __irq_exit_rcu+0x17e/0x1d0
[ +0.000005] softirqs last disabled at (80385): [<ffffffff8687377e>] __irq_exit_rcu+0x17e/0x1d0
[ +0.000006] ---[ end trace 0000000000000000 ]---
------------------------------------------------------------------------------------------------------
[ +0.000006] WARNING: CPU: 10 PID: 1818 at drivers/gpu/drm/ttm/ttm_bo.c:611 ttm_bo_unpin+0x21f/0x2c0 [ttm]
[ +0.000280] CPU: 10 UID: 1000 PID: 1818 Comm: Xwayland Tainted: G W 6.12.0+ #15
[ +0.000006] Tainted: [W]=WARN
[ +0.000004] Hardware name: ASUS System Product Name/TUF GAMING B650-PLUS, BIOS 3072 12/20/2024
[ +0.000004] RIP: 0010:ttm_bo_unpin+0x21f/0x2c0 [ttm]
[ +0.000005] RSP: 0018:ffff88846ca87888 EFLAGS: 00010246
[ +0.000007] RAX: 0000000000000000 RBX: ffff88810b7ca848 RCX: 0000000000000000
[ +0.000005] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ +0.000004] RBP: ffff88846ca878a0 R08: 0000000000000000 R09: 0000000000000000
[ +0.000004] R10: 0000000000000000 R11: 0000000000000000 R12: ffff888164e90050
[ +0.000005] R13: ffff88846c666200 R14: 0000000000000001 R15: ffff888168402d28
[ +0.000004] FS: 00007c45ff436d00(0000) GS:ffff888409500000(0000) knlGS:0000000000000000
[ +0.000005] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ +0.000004] CR2: 00007c45f7373b20 CR3: 000000012ce5a000 CR4: 0000000000f50ef0
[ +0.000005] PKRU: 55555554
[ +0.000004] Call Trace:
[ +0.000004] <TASK>
[ +0.000005] ? show_regs+0x6c/0x80
[ +0.000008] ? __warn+0xd2/0x2d0
[ +0.000007] ? ttm_bo_unpin+0x21f/0x2c0 [ttm]
[ +0.000012] ? report_bug+0x282/0x2f0
[ +0.000013] ? handle_bug+0x6e/0xc0
[ +0.000006] ? exc_invalid_op+0x18/0x50
[ +0.000008] ? asm_exc_invalid_op+0x1b/0x20
[ +0.000017] ? ttm_bo_unpin+0x21f/0x2c0 [ttm]
[ +0.000011] ? ttm_bo_unpin+0x217/0x2c0 [ttm]
[ +0.000011] amdgpu_bo_unpin+0x45/0x250 [amdgpu]
[ +0.000216] amdgpu_userq_ioctl+0x2c3/0xd40 [amdgpu]
[ +0.000226] ? drm_dev_exit+0x2d/0x60
[ +0.000010] ? __pfx_amdgpu_userq_ioctl+0x10/0x10 [amdgpu]
[ +0.000201] ? srso_alias_return_thunk+0x5/0xfbef5
[ +0.000005] ? lock_acquire+0x7c/0xc0
[ +0.000006] ? drm_dev_enter+0x51/0x190
[ +0.000015] drm_ioctl_kernel+0x18b/0x330
[ +0.000007] ? __pfx_amdgpu_userq_ioctl+0x10/0x10 [amdgpu]
[ +0.000188] ? __pfx_drm_ioctl_kernel+0x10/0x10
[ +0.000006] ? lock_acquire+0x7c/0xc0
[ +0.000008] ? srso_alias_return_thunk+0x5/0xfbef5
[ +0.000005] ? __kasan_check_write+0x14/0x30
[ +0.000006] ? srso_alias_return_thunk+0x5/0xfbef5
[ +0.000010] drm_ioctl+0x589/0xd00
[ +0.000005] ? srso_alias_return_thunk+0x5/0xfbef5
[ +0.000006] ? __pfx_amdgpu_userq_ioctl+0x10/0x10 [amdgpu]
[ +0.000211] ? __pfx_drm_ioctl+0x10/0x10
[ +0.000006] ? __pm_runtime_resume+0x80/0x110
[ +0.000020] ? srso_alias_return_thunk+0x5/0xfbef5
[ +0.000006] ? trace_hardirqs_on+0x53/0x60
[ +0.000005] ? srso_alias_return_thunk+0x5/0xfbef5
[ +0.000005] ? _raw_spin_unlock_irqrestore+0x51/0x80
[ +0.000013] amdgpu_drm_ioctl+0xd2/0x1c0 [amdgpu]
[ +0.000186] __x64_sys_ioctl+0x13a/0x1c0
[ +0.000010] x64_sys_call+0x11ad/0x25f0
[ +0.000007] do_syscall_64+0x91/0x180
[ +0.000007] ? srso_alias_return_thunk+0x5/0xfbef5
[ +0.000005] ? do_syscall_64+0x9d/0x180
[ +0.000007] ? srso_alias_return_thunk+0x5/0xfbef5
[ +0.000010] ? __pfx___rseq_handle_notify_resume+0x10/0x10
[ +0.000005] ? __pfx_blkcg_maybe_throttle_current+0x10/0x10
[ +0.000013] ? srso_alias_return_thunk+0x5/0xfbef5
[ +0.000009] ? srso_alias_return_thunk+0x5/0xfbef5
[ +0.000008] ? srso_alias_return_thunk+0x5/0xfbef5
[ +0.000005] ? syscall_exit_to_user_mode+0x95/0x260
[ +0.000008] ? srso_alias_return_thunk+0x5/0xfbef5
[ +0.000005] ? do_syscall_64+0x9d/0x180
[ +0.000007] ? srso_alias_return_thunk+0x5/0xfbef5
[ +0.000005] ? do_syscall_64+0x9d/0x180
[ +0.000011] ? srso_alias_return_thunk+0x5/0xfbef5
[ +0.000010] ? srso_alias_return_thunk+0x5/0xfbef5
[ +0.000009] ? srso_alias_return_thunk+0x5/0xfbef5
[ +0.000008] ? srso_alias_return_thunk+0x5/0xfbef5
[ +0.000005] ? irqentry_exit_to_user_mode+0x8b/0x260
[ +0.000007] ? srso_alias_return_thunk+0x5/0xfbef5
[ +0.000006] ? irqentry_exit+0x77/0xb0
[ +0.000004] ? srso_alias_return_thunk+0x5/0xfbef5
[ +0.000005] ? exc_page_fault+0x93/0x150
[ +0.000010] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ +0.000005] RIP: 0033:0x7c45ff924ded
[ +0.000005] RSP: 002b:00007ffff7168790 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ +0.000008] RAX: ffffffffffffffda RBX: 00000000c0486456 RCX: 00007c45ff924ded
[ +0.000005] RDX: 00007ffff71687f0 RSI: 00000000c0486456 RDI: 000000000000000b
[ +0.000004] RBP: 00007ffff71687e0 R08: 00005b0c2a49b010 R09: 0000000000000007
[ +0.000004] R10: 00005b0c2a4d7140 R11: 0000000000000246 R12: 000000000000000b
[ +0.000004] R13: 00007c45ff19e5cc R14: 00005b0c2a51c538 R15: 00005b0c2a51bbd8
[ +0.000022] </TASK>
[ +0.000005] irq event stamp: 87419
[ +0.000004] hardirqs last enabled at (87425): [<ffffffff86a693a9>] __up_console_sem+0x79/0xa0
[ +0.000005] hardirqs last disabled at (87430): [<ffffffff86a6938e>] __up_console_sem+0x5e/0xa0
[ +0.000005] softirqs last enabled at (87058): [<ffffffff8687377e>] __irq_exit_rcu+0x17e/0x1d0
[ +0.000006] softirqs last disabled at (87053): [<ffffffff8687377e>] __irq_exit_rcu+0x17e/0x1d0
[ +0.000005] ---[ end trace 0000000000000000 ]---
Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Fix lockdep warnings.
[ +0.000637] ================================
[ +0.000004] WARNING: inconsistent lock state
[ +0.000004] 6.12.0+ #18 Tainted: G W OE
[ +0.000004] --------------------------------
[ +0.000004] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
[ +0.000004] Xwayland/1952 [HC0[0]:SC0[0]:HE1:SE1] takes:
[ +0.000005] ffff8884636f4740 (&fence_drv->fence_list_lock){?...}-{2:2}, at: amdgpu_userq_fence_driver_destroy+0xb8/0x540 [amdgpu]
[ +0.000208] {IN-HARDIRQ-W} state was registered at:
[ +0.000004] lock_acquire.part.0+0x116/0x360
[ +0.000005] lock_acquire+0x7c/0xc0
[ +0.000005] _raw_spin_lock+0x2f/0x60
[ +0.000005] amdgpu_userq_fence_driver_process+0x75/0x400 [amdgpu]
[ +0.000185] gfx_v12_0_eop_irq+0x29f/0x420 [amdgpu]
[ +0.000210] amdgpu_irq_dispatch+0x2a4/0x7b0 [amdgpu]
[ +0.000191] amdgpu_ih_process+0x1e1/0x3d0 [amdgpu]
[ +0.000185] amdgpu_irq_handler+0x28/0xc0 [amdgpu]
[ +0.000183] __handle_irq_event_percpu+0x1bb/0x590
[ +0.000005] handle_irq_event+0xab/0x1d0
[ +0.000005] handle_edge_irq+0x1fd/0xc10
[ +0.000005] __common_interrupt+0x83/0x190
[ +0.000004] common_interrupt+0xb1/0xe0
[ +0.000005] asm_common_interrupt+0x27/0x40
[ +0.000004] cpuidle_enter_state+0x2ba/0x530
[ +0.000005] cpuidle_enter+0x4f/0xb0
[ +0.000006] call_cpuidle+0x46/0xd0
[ +0.000005] do_idle+0x367/0x430
[ +0.000004] cpu_startup_entry+0x58/0x70
[ +0.000005] start_secondary+0x224/0x2b0
[ +0.000005] common_startup_64+0x13e/0x141
[ +0.000005] irq event stamp: 88271
[ +0.000004] hardirqs last enabled at (88271): [<ffffffffad9ca7a1>] _raw_spin_unlock_irqrestore+0x51/0x80
[ +0.000005] hardirqs last disabled at (88270): [<ffffffffad9ca424>] _raw_spin_lock_irqsave+0x74/0x80
[ +0.000005] softirqs last enabled at (87858): [<ffffffffaa67377e>] __irq_exit_rcu+0x17e/0x1d0
[ +0.000005] softirqs last disabled at (87849): [<ffffffffaa67377e>] __irq_exit_rcu+0x17e/0x1d0
[ +0.000005]
other info that might help us debug this:
[ +0.000004] Possible unsafe locking scenario:
[ +0.000003] CPU0
[ +0.000004] ----
[ +0.000003] lock(&fence_drv->fence_list_lock);
v2:
Drop fence_list_flags and use xa_lock_irqsave() flags parameter (Christian)
Fix merge conflicts.
Signed-off-by: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
This patch enables per-queue and per-pipe reset functionality for
GFX IP v9.5.0 when using MEC firmware version 21 (0x15) or later.
This change:
1. Refactors the pipe reset support check in gfx_v9_4_3_pipe_reset_support()
to use the compute_supported_reset flags instead of hardcoding
version checks.
2. Adds support for GFX9.5.0 (IP 9.5.0) with MEC firmware version >= 21
to enable per-queue and per-pipe reset capabilities.
v2: Replaced mec version check with !!(adev->gfx.compute_supported_reset & AMDGPU_RESET_TYPE_PER_PIPE) (Lijo)
Signed-off-by: Asad Kamal <asad.kamal@amd.com>
Signed-off-by: Jesse Zhang <jesse.zhang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Set vram type so we can take different actions according to the type.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Make the code more general, user doesn't need to pay attention to the
detail of flip bits setting.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
The number of newly added de counts and the number of
newly added error addresses remain consistent
Signed-off-by: Ce Sun <cesun102@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
During driver load, RAS event manager may not be initialized. This will
cause any ATHUB event during driver load to be skipped in dmesg log. Log
the error in dmesg log for easier diagnosis.
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
This resolves a deadlock between user queue management and GPU reset
paths by enforcing consistent lock ordering.
The deadlock occurred when:
1. Process exit path (amdgpu_userq_mgr_fini) would:
- Take uqm->userq_mutex
- Then try to take adev->userq_mutex for list operations
2. GPU reset path (amdgpu_userq_pre_reset) would:
- Take adev->userq_mutex first (for list traversal)
- Then take uqm->userq_mutex
The solution establishes a strict top-down locking order:
1. Always take adev->userq_mutex before any uqm->userq_mutex
2. Maintain this order consistently across all code paths
Changes made:
- Reordered locking in amdgpu_userq_mgr_fini() to take device lock first
- Kept existing proper order in amdgpu_userq_pre_reset()
- Simplified the fini flow by removing redundant operations
This prevents circular dependencies while maintaining thread safety
during both normal operation and GPU reset scenarios.
Fixes: 4ce60dbada96 ("drm/amdgpu: store userq_managers in a list in adev")
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Arvind Yadav <Arvind.Yadav@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
kernel panic caused by RAS records exceeding the threshold when load driver
specifying RMA(bad_page_threshold=128)
1.Fix the warnings caused by disabling the interrupt source
before it was enabled
2.Fix kernel panic when xcp sysfs is not initialized,null pointer
appears during fini
3.Fix the memory leak caused by the device's early exit due to rma
The first reason:
[ 2744.246650] ------------[ cut here ]------------
[ 2744.246651] WARNING: CPU: 0 PID: 289 at /tmp/amd.BkfTLqYV/amd/amdgpu/amdgpu_irq.c:635 amdgpu_irq_put.cold+0x42/0x6e [amdgpu]
[ 2744.247108] Modules linked in: amdgpu(OE+) amddrm_ttm_helper(OE) amdttm(OE) amdxcp(OE) amddrm_buddy(OE) amddrm_exec(OE) amd_sched(OE) amdkcl(OE) xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo nft_counter xt_addrtype nft_compat nf_tables nfnetlink br_netfilter bridge stp llc overlay binfmt_misc intel_rapl_msr intel_rapl_common i10nm_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp ipmi_ssif kvm_intel nls_iso8859_1 kvm rapl isst_if_mbox_pci isst_if_mmio pmt_telemetry pmt_crashlog isst_if_common pmt_class mei_me mei acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad mac_hid sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua msr ramoops reed_solomon pstore_blk pstore_zone efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath
[ 2744.247167] linear mlx5_ib ib_uverbs ib_core ast i2c_algo_bit drm_vram_helper drm_ttm_helper ttm drm_kms_helper crct10dif_pclmul syscopyarea crc32_pclmul ghash_clmulni_intel mlx5_core sysfillrect sysimgblt aesni_intel mlxfw fb_sys_fops psample cec crypto_simd cryptd rc_core i2c_i801 nvme xhci_pci tls intel_pmt drm pci_hyperv_intf nvme_core i2c_smbus i2c_ismt xhci_pci_renesas wmi pinctrl_emmitsburg
[ 2744.247194] CPU: 0 PID: 289 Comm: kworker/0:1 Tainted: G OE 5.15.0-70-generic #77-Ubuntu
[ 2744.247197] Hardware name: Microsoft C278A/C278A, BIOS C2789.5.BS.1C23.AG.2 11/21/2024
[ 2744.247198] Workqueue: events work_for_cpu_fn
[ 2744.247206] RIP: 0010:amdgpu_irq_put.cold+0x42/0x6e [amdgpu]
[ 2744.247634] Code: 79 7f ff 44 89 ee 48 c7 c7 4d 5a 42 c2 89 55 d4 e8 90 09 bc bf 8b 55 d4 4c 89 e6 4c 89 ff e8 3c 76 7f ff 8b 55 d4 84 c0 75 07 <0f> 0b e9 95 79 7f ff 49 03 5c 24 08 f0 ff 0b 75 13 4c 89 e6 4c 89
[ 2744.247636] RSP: 0018:ffa0000019e27cb0 EFLAGS: 00010246
[ 2744.247639] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ff11000150fa87c0
[ 2744.247641] RDX: 0000000000000000 RSI: ffffffffc2222430 RDI: ff1100019f200000
[ 2744.247642] RBP: ffa0000019e27ce0 R08: 0000000000000003 R09: ffffffffffe41a08
[ 2744.247643] R10: 0000000000ffff0a R11: 0000000000000001 R12: ff1100019f22ce60
[ 2744.247644] R13: 0000000000000000 R14: 00000000ffffffea R15: ff1100019f200000
[ 2744.247645] FS: 0000000000000000(0000) GS:ff11007e7e400000(0000) knlGS:0000000000000000
[ 2744.247647] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2744.247649] CR2: 00007f3d2002819c CR3: 0000000006810003 CR4: 0000000000771ef0
[ 2744.247650] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2744.247651] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 2744.247652] PKRU: 55555554
[ 2744.247653] Call Trace:
[ 2744.247654] <TASK>
[ 2744.247656] sdma_v4_4_2_hw_fini+0x7a/0xc0 [amdgpu]
[ 2744.247997] ? vcn_v4_0_3_hw_fini+0x5f/0xa0 [amdgpu]
[ 2744.248336] amdgpu_ip_block_hw_fini+0x31/0x61 [amdgpu]
[ 2744.248776] amdgpu_device_fini_hw+0x3bb/0x47b [amdgpu]
[ 2744.249197] ? blocking_notifier_chain_unregister+0x56/0xb0
[ 2744.249202] amdgpu_driver_unload_kms+0x51/0x60 [amdgpu]
[ 2744.249482] amdgpu_driver_load_kms.cold+0x18/0x2e [amdgpu]
[ 2744.249913] amdgpu_pci_probe+0x23e/0x590 [amdgpu]
[ 2744.250187] local_pci_probe+0x48/0x90
[ 2744.250191] work_for_cpu_fn+0x17/0x30
[ 2744.250196] process_one_work+0x228/0x3d0
[ 2744.250198] worker_thread+0x223/0x420
[ 2744.250200] ? process_one_work+0x3d0/0x3d0
[ 2744.250201] kthread+0x127/0x150
[ 2744.250204] ? set_kthread_struct+0x50/0x50
[ 2744.250207] ret_from_fork+0x1f/0x30
[ 2744.250212] </TASK>
[ 2744.250213] ---[ end trace 488c997a88508bc3 ]---
The second reason:
[ 5139.303446] Memory manager not clean during takedown.
[ 5139.303509] WARNING: CPU: 145 PID: 117699 at drivers/gpu/drm/drm_mm.c:998 drm_mm_takedown+0x27/0x30 [drm]
[ 5139.303542] Modules linked in: amdgpu(OE+) amddrm_ttm_helper(OE) amdttm(OE) amdxcp(OE) amddrm_buddy(OE) amddrm_exec(OE) amd_sched(OE) amdkcl(OE) xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo nft_counter xt_addrtype nft_compat nf_tables nfnetlink br_netfilter bridge stp llc overlay intel_rapl_msr intel_rapl_common i10nm_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp ipmi_ssif kvm_intel binfmt_misc kvm nls_iso8859_1 rapl isst_if_mbox_pci pmt_telemetry pmt_crashlog isst_if_mmio pmt_class isst_if_common mei_me mei acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter mac_hid sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua msr ramoops reed_solomon pstore_blk pstore_zone efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath
[ 5139.303572] linear mlx5_ib ib_uverbs ib_core crct10dif_pclmul ast crc32_pclmul i2c_algo_bit ghash_clmulni_intel aesni_intel crypto_simd drm_vram_helper cryptd drm_ttm_helper mlx5_core ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec rc_core mlxfw psample intel_pmt nvme xhci_pci drm tls i2c_i801 pci_hyperv_intf nvme_core i2c_smbus i2c_ismt xhci_pci_renesas wmi pinctrl_emmitsburg [last unloaded: amdkcl]
[ 5139.303588] CPU: 145 PID: 117699 Comm: modprobe Tainted: G U OE 5.15.0-70-generic #77-Ubuntu
[ 5139.303590] Hardware name: Microsoft C278A/C278A, BIOS C2789.5.BS.1C23.AG.2 11/21/2024
[ 5139.303591] RIP: 0010:drm_mm_takedown+0x27/0x30 [drm]
[ 5139.303605] Code: cc 66 90 0f 1f 44 00 00 48 8b 47 38 48 83 c7 38 48 39 f8 75 05 c3 cc cc cc cc 55 48 c7 c7 18 d0 10 c0 48 89 e5 e8 5a bc c3 c1 <0f> 0b 5d c3 cc cc cc cc 90 0f 1f 44 00 00 55 b9 15 00 00 00 48 89
[ 5139.303607] RSP: 0018:ffa00000325c3940 EFLAGS: 00010286
[ 5139.303608] RAX: 0000000000000000 RBX: ff1100012f5cecb0 RCX: 0000000000000027
[ 5139.303609] RDX: ff11007e7fa60588 RSI: 0000000000000001 RDI: ff11007e7fa60580
[ 5139.303610] RBP: ffa00000325c3940 R08: 0000000000000003 R09: fffffffff00c2b78
[ 5139.303610] R10: 000000000000002b R11: 0000000000000001 R12: ff1100012f5cec00
[ 5139.303611] R13: ff1100012138f068 R14: 0000000000000000 R15: ff1100012f5cec90
[ 5139.303611] FS: 00007f42ffca0000(0000) GS:ff11007e7fa40000(0000) knlGS:0000000000000000
[ 5139.303612] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5139.303613] CR2: 00007f23d945ab68 CR3: 00000001212ce005 CR4: 0000000000771ee0
[ 5139.303614] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 5139.303615] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 5139.303615] PKRU: 55555554
[ 5139.303616] Call Trace:
[ 5139.303617] <TASK>
[ 5139.303619] amdttm_range_man_fini_nocheck+0xfe/0x1c0 [amdttm]
[ 5139.303625] amdgpu_ttm_fini+0x2ed/0x390 [amdgpu]
[ 5139.303800] amdgpu_bo_fini+0x27/0xc0 [amdgpu]
[ 5139.303959] gmc_v9_0_sw_fini+0x63/0x90 [amdgpu]
[ 5139.304144] amdgpu_device_fini_sw+0x125/0x6a0 [amdgpu]
[ 5139.304302] amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
[ 5139.304455] devm_drm_dev_init_release+0x4a/0x80 [drm]
[ 5139.304472] devm_action_release+0x12/0x20
[ 5139.304476] release_nodes+0x3d/0xb0
[ 5139.304478] devres_release_all+0x9b/0xd0
[ 5139.304480] really_probe+0x11d/0x420
[ 5139.304483] __driver_probe_device+0x119/0x190
[ 5139.304485] driver_probe_device+0x23/0xc0
[ 5139.304487] __driver_attach+0xf7/0x1f0
[ 5139.304489] ? __device_attach_driver+0x140/0x140
[ 5139.304491] bus_for_each_dev+0x7c/0xd0
[ 5139.304493] driver_attach+0x1e/0x30
[ 5139.304494] bus_add_driver+0x148/0x220
[ 5139.304496] driver_register+0x95/0x100
[ 5139.304498] __pci_register_driver+0x68/0x70
[ 5139.304500] amdgpu_init+0xbc/0x1000 [amdgpu]
[ 5139.304655] ? 0xffffffffc0b8f000
[ 5139.304657] do_one_initcall+0x46/0x1e0
[ 5139.304659] ? kmem_cache_alloc_trace+0x19e/0x2e0
[ 5139.304663] do_init_module+0x52/0x260
[ 5139.304665] load_module+0xb2b/0xbc0
[ 5139.304667] __do_sys_finit_module+0xbf/0x120
[ 5139.304669] __x64_sys_finit_module+0x18/0x20
[ 5139.304670] do_syscall_64+0x59/0xc0
[ 5139.304673] ? exit_to_user_mode_prepare+0x37/0xb0
[ 5139.304676] ? syscall_exit_to_user_mode+0x27/0x50
[ 5139.304678] ? __x64_sys_mmap+0x33/0x50
[ 5139.304680] ? do_syscall_64+0x69/0xc0
[ 5139.304681] entry_SYSCALL_64_after_hwframe+0x61/0xcb
[ 5139.304684] RIP: 0033:0x7f42ffdbf88d
[ 5139.304686] Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 73 b5 0f 00 f7 d8 64 89 01 48
[ 5139.304687] RSP: 002b:00007ffcb7427158 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[ 5139.304688] RAX: ffffffffffffffda RBX: 000055ce8b8f3150 RCX: 00007f42ffdbf88d
[ 5139.304689] RDX: 0000000000000000 RSI: 000055ce8b8f9a70 RDI: 000000000000000a
[ 5139.304690] RBP: 0000000000040000 R08: 0000000000000000 R09: 0000000000000011
[ 5139.304690] R10: 000000000000000a R11: 0000000000000246 R12: 000055ce8b8f9a70
[ 5139.304691] R13: 000055ce8b8f2ec0 R14: 000055ce8b8f2ab0 R15: 000055ce8b8f9aa0
[ 5139.304692] </TASK>
[ 5139.304693] ---[ end trace 8536b052f7883003 ]---
Signed-off-by: Ce Sun <cesun102@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Get RAS retire flip bits for HBM with different types in various NPS modes.
Also set flip row bit and MCA R13 bit in PA in different NPS modes.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
The RAS bad page retire flip bits can be set per vram type,
vram vendor and NPS mode.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Add the general interface to get flip bits for RAS bad page retirement.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Per UMC address conversion algorithm, the high row bits of UMC MCA
address are changed when they're converted into normalized address
on specific ASICs.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Support new version of HBM.
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
there is only MCA records in V3, no need to care about PA records.
recalculate the value of ras_num_bad_pages when parsing failed and
go on with the left records instead of quit.
Signed-off-by: ganglxie <ganglxie@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
On GFX1151, the reported MALL cache size reflects only
half of its actual size; this adjustment corrects the discrepancy.
Signed-off-by: Tim Huang <tim.huang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
To resolve the warning regarding the missing error code 'r' in
amdgpu_userq_wait_ioctl(), assign the value 'r = -EINVAL'.
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202505080458.rnV8YfiY-lkp@intel.com/
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Arunpravin Paneer Selvam <Arunpravin.PaneerSelvam@amd.com>
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Arvind Yadav <Arvind.Yadav@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Switch cancel_delayed_work() to cancel_delayed_work_sync() to ensure
the delayed work has finished executing before proceeding with
resource cleanup. This prevents a potential use-after-free or
NULL dereference if the resume_work is still running during finalization.
BUG: kernel NULL pointer dereference, address: 0000000000000140
[ +0.000050] #PF: supervisor read access in kernel mode
[ +0.000019] #PF: error_code(0x0000) - not-present page
[ +0.000021] PGD 0 P4D 0
[ +0.000015] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
[ +0.000021] CPU: 17 UID: 0 PID: 196299 Comm: kworker/17:0 Tainted: G U 6.14.0-org-staging #1
[ +0.000032] Tainted: [U]=USER
[ +0.000015] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS ELITE/X570 AORUS ELITE, BIOS F39 03/22/2024
[ +0.000029] Workqueue: events amdgpu_userq_restore_worker [amdgpu]
[ +0.000426] RIP: 0010:drm_exec_lock_obj+0x32/0x210 [drm_exec]
[ +0.000025] Code: e5 41 57 41 56 41 55 49 89 f5 41 54 49 89 fc 48 83 ec 08 4c 8b 77 30 4d 85 f6 0f 85 c0 00 00 00 4c 8d 7f 08 48 39 77 38 74 54 <49> 8b bd f8 00 00 00 4c 89 fe 41 f6 04 24 01 75 3c e8 08 50 bc e0
[ +0.000046] RSP: 0018:ffffab1b04da3ce8 EFLAGS: 00010297
[ +0.000020] RAX: 0000000000000001 RBX: ffff930cc60e4bc0 RCX: 0000000000000000
[ +0.000025] RDX: 0000000000000004 RSI: 0000000000000048 RDI: ffffab1b04da3d88
[ +0.000028] RBP: ffffab1b04da3d10 R08: ffff930cc60e4000 R09: 0000000000000000
[ +0.000022] R10: ffffab1b04da3d18 R11: 0000000000000001 R12: ffffab1b04da3d88
[ +0.000023] R13: 0000000000000048 R14: 0000000000000000 R15: ffffab1b04da3d90
[ +0.000023] FS: 0000000000000000(0000) GS:ffff9313dea80000(0000) knlGS:0000000000000000
[ +0.000024] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ +0.000021] CR2: 0000000000000140 CR3: 000000018351a000 CR4: 0000000000350ef0
[ +0.000025] Call Trace:
[ +0.000018] <TASK>
[ +0.000015] ? show_regs+0x69/0x80
[ +0.000022] ? __die+0x25/0x70
[ +0.000019] ? page_fault_oops+0x15d/0x510
[ +0.000024] ? do_user_addr_fault+0x312/0x690
[ +0.000024] ? sched_clock_cpu+0x10/0x1a0
[ +0.000028] ? exc_page_fault+0x78/0x1b0
[ +0.000025] ? asm_exc_page_fault+0x27/0x30
[ +0.000024] ? drm_exec_lock_obj+0x32/0x210 [drm_exec]
[ +0.000024] drm_exec_prepare_obj+0x21/0x60 [drm_exec]
[ +0.000021] amdgpu_vm_lock_pd+0x22/0x30 [amdgpu]
[ +0.000266] amdgpu_userq_validate_bos+0x6c/0x320 [amdgpu]
[ +0.000333] amdgpu_userq_restore_worker+0x4a/0x120 [amdgpu]
[ +0.000316] process_one_work+0x189/0x3c0
[ +0.000021] worker_thread+0x2a4/0x3b0
[ +0.000022] kthread+0x109/0x220
[ +0.000018] ? __pfx_worker_thread+0x10/0x10
[ +0.000779] ? _raw_spin_unlock_irq+0x1f/0x40
[ +0.000560] ? __pfx_kthread+0x10/0x10
[ +0.000543] ret_from_fork+0x3c/0x60
[ +0.000507] ? __pfx_kthread+0x10/0x10
[ +0.000515] ret_from_fork_asm+0x1a/0x30
[ +0.000515] </TASK>
v2: Replace cancel_delayed_work() to cancel_delayed_work_sync()
in amdgpu_userq_destroy() and amdgpu_userq_evict().
Cc: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Sunil Khatri <sunil.khatri@amd.com>
Signed-off-by: Arvind Yadav <arvind.yadav@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
After process exit to unmap csa and free GPU vm, if signal is accepted
and then waiting to take vm lock is interrupted and return, it causes
memory leaking and below warning backtrace.
Change to use uninterruptible wait lock fix the issue.
WARNING: CPU: 69 PID: 167800 at amd/amdgpu/amdgpu_kms.c:1525
amdgpu_driver_postclose_kms+0x294/0x2a0 [amdgpu]
Call Trace:
<TASK>
drm_file_free.part.0+0x1da/0x230 [drm]
drm_close_helper.isra.0+0x65/0x70 [drm]
drm_release+0x6a/0x120 [drm]
amdgpu_drm_release+0x51/0x60 [amdgpu]
__fput+0x9f/0x280
____fput+0xe/0x20
task_work_run+0x67/0xa0
do_exit+0x217/0x3c0
do_group_exit+0x3b/0xb0
get_signal+0x14a/0x8d0
arch_do_signal_or_restart+0xde/0x100
exit_to_user_mode_loop+0xc1/0x1a0
exit_to_user_mode_prepare+0xf4/0x100
syscall_exit_to_user_mode+0x17/0x40
do_syscall_64+0x69/0xc0
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
https://gitlab.freedesktop.org/agd5f/linux into drm-next
amd-drm-next-6.16-2025-05-09:
amdgpu:
- IPS fixes
- DSC cleanup
- DC Scaling updates
- DC FP fixes
- Fused I2C-over-AUX updates
- SubVP fixes
- Freesync fix
- DMUB AUX fixes
- VCN fix
- Hibernation fixes
- HDP fixes
- DCN 2.1 fixes
- DPIA fixes
- DMUB updates
- Use drm_file_err in amdgpu
- Enforce isolation updates
- Use new dma_fence helpers
- USERQ fixes
- Documentation updates
- Misc code cleanups
- SR-IOV updates
- RAS updates
- PSP 12 cleanups
amdkfd:
- Update error messages for SDMA
- Userptr updates
drm:
- Add drm_file_err function
dma-buf:
- Add a helper to sort and deduplicate dma_fence arrays
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://lore.kernel.org/r/20250509230951.3871914-1-alexander.deucher@amd.com
Signed-off-by: Dave Airlie <airlied@redhat.com>
|
|
Reading back the remapped HDP flush register seems to cause
problems on some platforms. All we need is a read, so read back
the memcfg register.
Fixes: 689275140cb8 ("drm/amdgpu/hdp7.0: do a posting read when flushing HDP")
Reported-by: Alexey Klimov <alexey.klimov@linaro.org>
Link: https://lists.freedesktop.org/archives/amd-gfx/2025-April/123150.html
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4119
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3908
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit dbc064adfcf9095e7d895bea87b2f75c1ab23236)
Cc: stable@vger.kernel.org
|
|
Reading back the remapped HDP flush register seems to cause
problems on some platforms. All we need is a read, so read back
the memcfg register.
Fixes: abe1cbaec6cf ("drm/amdgpu/hdp6.0: do a posting read when flushing HDP")
Reported-by: Alexey Klimov <alexey.klimov@linaro.org>
Link: https://lists.freedesktop.org/archives/amd-gfx/2025-April/123150.html
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4119
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3908
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 84141ff615951359c9a99696fd79a36c465ed847)
Cc: stable@vger.kernel.org
|
|
Reading back the remapped HDP flush register seems to cause
problems on some platforms. All we need is a read, so read back
the memcfg register.
Fixes: f756dbac1ce1 ("drm/amdgpu/hdp5.2: do a posting read when flushing HDP")
Reported-by: Alexey Klimov <alexey.klimov@linaro.org>
Link: https://lists.freedesktop.org/archives/amd-gfx/2025-April/123150.html
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4119
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3908
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 4a89b7698e771914b4d5b571600c76e2fdcbe2a9)
Cc: stable@vger.kernel.org
|
|
Reading back the remapped HDP flush register seems to cause
problems on some platforms. All we need is a read, so read back
the memcfg register.
Fixes: cf424020e040 ("drm/amdgpu/hdp5.0: do a posting read when flushing HDP")
Reported-by: Alexey Klimov <alexey.klimov@linaro.org>
Link: https://lists.freedesktop.org/archives/amd-gfx/2025-April/123150.html
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4119
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3908
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit a5cb344033c7598762e89255e8ff52827abb57a4)
Cc: stable@vger.kernel.org
|
|
Except HDP v5.2 all use a common logic for HDP flush. Use a generic
function. HDP v5.2 forces NO_KIQ logic, revisit it later.
Reapply after fixing up an HDP regression.
v2: merge the fix (Alex)
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com> (v1)
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Reading back the remapped HDP flush register seems to cause
problems on some platforms. All we need is a read, so read back
the memcfg register.
Fixes: 689275140cb8 ("drm/amdgpu/hdp7.0: do a posting read when flushing HDP")
Reported-by: Alexey Klimov <alexey.klimov@linaro.org>
Link: https://lists.freedesktop.org/archives/amd-gfx/2025-April/123150.html
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4119
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3908
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Reading back the remapped HDP flush register seems to cause
problems on some platforms. All we need is a read, so read back
the memcfg register.
Fixes: abe1cbaec6cf ("drm/amdgpu/hdp6.0: do a posting read when flushing HDP")
Reported-by: Alexey Klimov <alexey.klimov@linaro.org>
Link: https://lists.freedesktop.org/archives/amd-gfx/2025-April/123150.html
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4119
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3908
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
PSP v12 won't have SRIOV function.
Signed-off-by: Huang Rui <ray.huang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Reading back the remapped HDP flush register seems to cause
problems on some platforms. All we need is a read, so read back
the memcfg register.
Fixes: f756dbac1ce1 ("drm/amdgpu/hdp5.2: do a posting read when flushing HDP")
Reported-by: Alexey Klimov <alexey.klimov@linaro.org>
Link: https://lists.freedesktop.org/archives/amd-gfx/2025-April/123150.html
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4119
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3908
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Reading back the remapped HDP flush register seems to cause
problems on some platforms. All we need is a read, so read back
the memcfg register.
Fixes: cf424020e040 ("drm/amdgpu/hdp5.0: do a posting read when flushing HDP")
Reported-by: Alexey Klimov <alexey.klimov@linaro.org>
Link: https://lists.freedesktop.org/archives/amd-gfx/2025-April/123150.html
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4119
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3908
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
APU doesn't have second IH ring, so re-routing action here is a no-op.
It will take a lot of time to wait timeout from PSP during the
initialization. So remove the function in psp v12.
Signed-off-by: Huang Rui <ray.huang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Reading back the remapped HDP flush register seems to cause
problems on some platforms. All we need is a read, so read back
the memcfg register.
Fixes: c9b8dcabb52a ("drm/amdgpu/hdp4.0: do a posting read when flushing HDP")
Reported-by: Alexey Klimov <alexey.klimov@linaro.org>
Link: https://lists.freedesktop.org/archives/amd-gfx/2025-April/123150.html
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4119
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3908
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 5c937b4a6050316af37ef214825b6340b5e9e391)
Cc: stable@vger.kernel.org
|
|
Set the s3/s0ix and s4 flags in the pm notifier so that we can skip
the resource evictions properly in pm prepare based on whether
we are suspending or hibernating. Drop the eviction as processes
are not frozen at this time, we we can end up getting stuck trying
to evict VRAM while applications continue to submit work which
causes the buffers to get pulled back into VRAM.
v2: Move suspend flags out of pm notifier (Mario)
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4178
Fixes: 2965e6355dcd ("drm/amd: Add Suspend/Hibernate notification callback support")
Cc: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 06f2dcc241e7e5c681f81fbc46cacdf4bfd7d6d7)
Cc: stable@vger.kernel.org
|
|
This reverts commit 3a9626c816db901def438dc2513622e281186d39.
This breaks S4 because we end up setting the s3/s0ix flags
even when we are entering s4 since prepare is used by both
flows. The causes both the S3/s0ix and s4 flags to be set
which breaks several checks in the driver which assume they
are mutually exclusive.
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3634
Cc: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit ce8f7d95899c2869b47ea6ce0b3e5bf304b2fff4)
Cc: stable@vger.kernel.org
|
|
VCN1_AON_SOC_ADDRESS_3_0 offset varies on different
VCN generations, the issue in vcn4.0.5 is caused by
a different VCN1_AON_SOC_ADDRESS_3_0 offset.
This patch does the following:
1. use the same offset for other VCN generations.
2. use the vcn4.0.5 special offset
3. update vcn_4_0 and vcn_5_0
Acked-by: Saleemkhan Jamadar <saleemkhan.jamadar@amd.com>
Reviewed-by: Leo Liu <leo.liu@amd.com>
Signed-off-by: Ruijing Dong <ruijing.dong@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 5c89ceda9984498b28716944633a9a01cbb2c90d)
Cc: stable@vger.kernel.org
|
|
If there is a problem requiring a reset of the VCN engine, it is better to
reset the VCN engine rather than the entire GPU.
Add a reset callback for the ring which will stop and start VCN if an
issue happens.
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Link: https://lore.kernel.org/r/20250506204948.12048-4-mario.limonciello@amd.com
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
If there is a problem requiring a reset of the VCN engine, it is better to
reset the VCN engine rather than the entire GPU.
Add a reset callback for the ring which will stop and start VCN if an
issue happens.
Link: https://lore.kernel.org/r/20250506204948.12048-3-mario.limonciello@amd.com
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
There is a problem occurring on VCN 4.0.5 where in some situations a job
is timing out. This triggers a job timeout which then causes a GPU
reset for recovery. That has exposed a number of issues with GPU reset
that have since been fixed. But also a GPU reset isn't actually needed
for this circumstance. Just restarting the ring is enough.
Add a reset callback for the ring which will stop and start VCN if the
issue happens.
Link: https://gitlab.freedesktop.org/mesa/mesa/-/issues/12528
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/3909
Link: https://lore.kernel.org/r/20250506204948.12048-2-mario.limonciello@amd.com
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Reading back the remapped HDP flush register seems to cause
problems on some platforms. All we need is a read, so read back
the memcfg register.
Fixes: c9b8dcabb52a ("drm/amdgpu/hdp4.0: do a posting read when flushing HDP")
Reported-by: Alexey Klimov <alexey.klimov@linaro.org>
Link: https://lists.freedesktop.org/archives/amd-gfx/2025-April/123150.html
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4119
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3908
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
This reverts commit 18a878fd8aef0ec21648a3782f55a79790cd4073.
Revert this temporarily to make it easier to fix a regression
in the HDP handling.
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
fix the indentation
drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c:6992 gfx_v11_ip_dump
compiler: gcc-11 (Debian 11.3.0-12) 11.3.0
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202505071619.7sHTLpNg-lkp@intel.com/
Signed-off-by: Sunil Khatri <sunil.khatri@amd.com>
Reviewed-by: Arvind Yadav <Arvind.Yadav@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|