summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-11-27drm/xe: Set XE_BO_FLAG_PINNED in migrate selftest BOsMatthew Brost
We only allow continguous BOs to be vmapped, set XE_BO_FLAG_PINNED on BOs in migrate selftest as this forces continguous BOs and selftest uses vmaps. Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241126174615.2665852-8-matthew.brost@intel.com
2024-11-27drm/xe: Use ttm_bo_access in xe_vm_snapshot_capture_delayedMatthew Brost
Non-contiguous mapping of BO in VRAM doesn't work, use ttm_bo_access instead. v2: - Fix error handling Fixes: 0eb2a18a8fad ("drm/xe: Implement VM snapshot support for BO's and userptr") Suggested-by: Matthew Auld <matthew.auld@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241126174615.2665852-7-matthew.brost@intel.com
2024-11-27drm/xe/display: Update intel_bo_read_from_page to use ttm_bo_accessMatthew Brost
Don't open code vmap of a BO, use ttm_bo_access helper which is safe for non-contiguous BOs and non-visible BOs. Suggested-by: Matthew Auld <matthew.auld@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241126174615.2665852-6-matthew.brost@intel.com
2024-11-27drm/xe: Take PM ref in delayed snapshot capture workerMatthew Brost
The delayed snapshot capture worker can access the GPU or VRAM both of which require a PM reference. Take a reference in this worker. Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Fixes: 4f04d07c0a94 ("drm/xe: Faster devcoredump") Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241126174615.2665852-5-matthew.brost@intel.com
2024-11-27drm/xe: Add xe_ttm_access_memoryMatthew Brost
Non-contiguous VRAM cannot easily be mapped in TTM nor can non-visible VRAM easily be accessed. Add xe_ttm_access_memory which hooks into ttm_bo_access to access such memory. v4: - Assert memory access rather than taking RPM ref (Thomas / Auld) - Fix warning on xe_res_cursor.h for non-zero offset (Mika) Reported-by: Christoph Manszewski <christoph.manszewski@intel.com> Suggested-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241126174615.2665852-4-matthew.brost@intel.com
2024-11-27drm/ttm: Add ttm_bo_accessMatthew Brost
Non-contiguous VRAM cannot easily be mapped in TTM nor can non-visible VRAM easily be accessed. Add ttm_bo_access, which is similar to ttm_bo_vm_access, to access such memory. v4: - Fix checkpatch warnings (CI) v5: - Fix checkpatch warnings (CI) v6: - Fix kernel doc (Auld) v7: - Move ttm_bo_access to ttm_bo_vm.c (Christian) Cc: Christian König <christian.koenig@amd.com> Reported-by: Christoph Manszewski <christoph.manszewski@intel.com> Suggested-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Tested-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241126174615.2665852-3-matthew.brost@intel.com
2024-11-27drm/xe: Add xe_bo_vm_accessMatthew Brost
Add xe_bo_vm_access which is wrapper around ttm_bo_vm_access which takes rpm refs for device access. Suggested-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241126174615.2665852-2-matthew.brost@intel.com
2024-11-27drm/xe/migrate: use XE_BO_FLAG_PAGETABLEMatthew Auld
On some HW we want to avoid the host caching PTEs, since access from GPU side can be incoherent. However here the special migrate object is mapping PTEs which are written from the host and potentially cached. Use XE_BO_FLAG_PAGETABLE to ensure that non-cached mapping is used, on platforms where this matters. Fixes: 7a060d786cc1 ("drm/xe/mtl: Map PPGTT as CPU:WC") Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Nirmoy Das <nirmoy.das@intel.com> Cc: <stable@vger.kernel.org> # v6.8+ Reviewed-by: Nirmoy Das <nirmoy.das@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241126181259.159713-4-matthew.auld@intel.com
2024-11-27drm/xe/migrate: fix pat index usageMatthew Auld
XE_CACHE_WB must be converted into the per-platform pat index for that particular caching mode, otherwise we are just encoding whatever happens to be the value of that enum. Fixes: e8babb280b5e ("drm/xe: Convert multiple bind ops into single job") Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Nirmoy Das <nirmoy.das@intel.com> Cc: <stable@vger.kernel.org> # v6.12+ Reviewed-by: Nirmoy Das <nirmoy.das@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241126181259.159713-3-matthew.auld@intel.com
2024-11-27drm/xe/xe3lpg: Add Wa_16024792527Apoorva Singh
Force Sampler Tile64 Overfetch via MMIO Signed-off-by: Apoorva Singh <apoorva.singh@intel.com> Reviewed-by: Gustavo Sousa <gustavo.sousa@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241107082158.1436637-1-apoorva.singh@intel.com Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-11-26drm/xe/guc_submit: fix race around suspend_pendingMatthew Auld
Currently in some testcases we can trigger: xe 0000:03:00.0: [drm] Assertion `exec_queue_destroyed(q)` failed! .... WARNING: CPU: 18 PID: 2640 at drivers/gpu/drm/xe/xe_guc_submit.c:1826 xe_guc_sched_done_handler+0xa54/0xef0 [xe] xe 0000:03:00.0: [drm] *ERROR* GT1: DEREGISTER_DONE: Unexpected engine state 0x00a1, guc_id=57 Looking at a snippet of corresponding ftrace for this GuC id we can see: 162.673311: xe_sched_msg_add: dev=0000:03:00.0, gt=1 guc_id=57, opcode=3 162.673317: xe_sched_msg_recv: dev=0000:03:00.0, gt=1 guc_id=57, opcode=3 162.673319: xe_exec_queue_scheduling_disable: dev=0000:03:00.0, 1:0x2, gt=1, width=1, guc_id=57, guc_state=0x29, flags=0x0 162.674089: xe_exec_queue_kill: dev=0000:03:00.0, 1:0x2, gt=1, width=1, guc_id=57, guc_state=0x29, flags=0x0 162.674108: xe_exec_queue_close: dev=0000:03:00.0, 1:0x2, gt=1, width=1, guc_id=57, guc_state=0xa9, flags=0x0 162.674488: xe_exec_queue_scheduling_done: dev=0000:03:00.0, 1:0x2, gt=1, width=1, guc_id=57, guc_state=0xa9, flags=0x0 162.678452: xe_exec_queue_deregister: dev=0000:03:00.0, 1:0x2, gt=1, width=1, guc_id=57, guc_state=0xa1, flags=0x0 It looks like we try to suspend the queue (opcode=3), setting suspend_pending and triggering a disable_scheduling. The user then closes the queue. However the close will also forcefully signal the suspend fence after killing the queue, later when the G2H response for disable_scheduling comes back we have now cleared suspend_pending when signalling the suspend fence, so the disable_scheduling now incorrectly tries to also deregister the queue. This leads to warnings since the queue has yet to even be marked for destruction. We also seem to trigger errors later with trying to double unregister the same queue. To fix this tweak the ordering when handling the response to ensure we don't race with a disable_scheduling that didn't actually intend to perform an unregister. The destruction path should now also correctly wait for any pending_disable before marking as destroyed. Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs") Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3371 Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: <stable@vger.kernel.org> # v6.8+ Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241122161914.321263-6-matthew.auld@intel.com
2024-11-26drm/xe/guc_submit: fix race around pending_disableMatthew Auld
Currently in some testcases we can trigger: [drm] *ERROR* GT0: SCHED_DONE: Unexpected engine state 0x02b1, guc_id=8, runnable_state=0 [drm] *ERROR* GT0: G2H action 0x1002 failed (-EPROTO) len 3 msg 02 10 00 90 08 00 00 00 00 00 00 00 Looking at a snippet of corresponding ftrace for this GuC id we can see: 498.852891: xe_sched_msg_add: dev=0000:03:00.0, gt=0 guc_id=8, opcode=3 498.854083: xe_sched_msg_recv: dev=0000:03:00.0, gt=0 guc_id=8, opcode=3 498.855389: xe_exec_queue_kill: dev=0000:03:00.0, 5:0x1, gt=0, width=1, guc_id=8, guc_state=0x3, flags=0x0 498.855436: xe_exec_queue_lr_cleanup: dev=0000:03:00.0, 5:0x1, gt=0, width=1, guc_id=8, guc_state=0x83, flags=0x0 498.856767: xe_exec_queue_close: dev=0000:03:00.0, 5:0x1, gt=0, width=1, guc_id=8, guc_state=0x83, flags=0x0 498.862889: xe_exec_queue_scheduling_disable: dev=0000:03:00.0, 5:0x1, gt=0, width=1, guc_id=8, guc_state=0xa9, flags=0x0 498.863032: xe_exec_queue_scheduling_disable: dev=0000:03:00.0, 5:0x1, gt=0, width=1, guc_id=8, guc_state=0x2b9, flags=0x0 498.875596: xe_exec_queue_scheduling_done: dev=0000:03:00.0, 5:0x1, gt=0, width=1, guc_id=8, guc_state=0x2b9, flags=0x0 498.875604: xe_exec_queue_deregister: dev=0000:03:00.0, 5:0x1, gt=0, width=1, guc_id=8, guc_state=0x2b1, flags=0x0 499.074483: xe_exec_queue_deregister_done: dev=0000:03:00.0, 5:0x1, gt=0, width=1, guc_id=8, guc_state=0x2b1, flags=0x0 This looks to be the two scheduling_disable racing with each other, one from the suspend (opcode=3) and then again during lr cleanup. While those two operations are serialized, the G2H portion is not, therefore when marking the queue as pending_disabled and then firing off the first request, we proceed do the same again, however the first disable response only fires after this which then clears the pending_disabled. At this point the second comes back and is processed, however the pending_disabled is no longer set, hence triggering the warning. To fix this wait for pending_disabled when doing the lr cleanup and calling disable_scheduling_deregister. Also do the same for all other disable_scheduling callers. Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs") Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3515 Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: <stable@vger.kernel.org> # v6.8+ Reviewed-by: Matthew Brost <mattheq.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241122161914.321263-5-matthew.auld@intel.com
2024-11-26drm/xe/trace: improve xe_sched_msg traceMatthew Auld
Also include the gt_id, that way we can ignore duplicate guc_id across different GTs when applying some filtering. Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241122161914.321263-4-matthew.auld@intel.com
2024-11-26drm/xe/vram: fix lpfn checkMatthew Auld
Technically we should check the lfpn value and not the place->lpfn, for the case where the allocation itself could be as large as the entire region and not be range based, which might result in incorrectly doing a power-of-two roundup. The allocator itself will already ensure it's contiguous underneath for such an allocation. This shouldn't fix any current usecase, but never the less came up from some internal testing. Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Piotr Piórkowski <piotr.piorkowski@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241119101926.190203-2-matthew.auld@intel.com
2024-11-25drm/xe: Update xe2_graphics name stringMatt Roper
Since both Xe2 and Xe3 platforms currently use the same set of graphics IP feature flags, we associate the "graphics_xe2" structure with both IPs. Update the name string on that IP structure to clarify this and avoid confusion as Xe3 platforms start going into public CI. Fixes: 800d75bf20ae ("drm/xe/xe3: Define Xe3 feature flags") Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Reviewed-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241125194838.1190599-2-matthew.d.roper@intel.com
2024-11-22drm/xe/guc: Add support for G2G communicationsJohn Harrison
Some features require inter-GuC communication channels on multi-tile devices. So allocate and enable such. v2: Correct use of xe_bo_get/put (review feedback from Matthew Brost) Add extra assert, re-order a calculation for better clarity and add comments to slot calculation (review feedback from Daniele). Also slightly re-work the slot calc to avoid code duplication. Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241120000222.204095-3-John.C.Harrison@Intel.com
2024-11-22drm/xe: Allow bo mapping on multiple ggttsNiranjana Vishwanathapura
Make bo->ggtt an array to support bo mapping on multiple ggtts. Add XE_BO_FLAG_GGTTx flags to map the bo on ggtt of tile 'x'. Signed-off-by: Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241120000222.204095-2-John.C.Harrison@Intel.com
2024-11-22drm/xe/ptl: Add another PTL PCI IDMatt Atwood
An additional pci id has been added to bspec. Bspec: 72574 Signed-off-by: Matt Atwood <matthew.s.atwood@intel.com> Reviewed-by: Jordan Justen <jordan.l.justen@intel.com> Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241114234410.145863-1-matthew.s.atwood@intel.com
2024-11-22drm/xe/pf: Drop 2GiB limit of fair LMEM allocationMichal Wajdeczko
Since commit 678ccbf98796 ("drm/xe/vram: drop 2G block restriction") we are able to provision VFs with more than 2GiB. Drop our temporary limit of maximum fair LMEM size that was added just to avoid hitting -EINVAL from auto-provisioning. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Reviewed-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241121175754.302-1-michal.wajdeczko@intel.com
2024-11-21drm/xe: Sort again the info flagsLucas De Marchi
Those flags are supposed to be kept sorted alphabetically. Unfortunately it's a constant battle as new flags are added to the end or at random places. Sort it again. v2: Include the other non-has_* 1-bit flags in the sort v3: Add comment to keep flags sorted Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> # v1 Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241120195710.3447100-1-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2024-11-18drm/xe: Drop useless d3cold allowed messageMatthew Brost
This message just spams dmesg providing little benefit. Remove it. Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241115192155.2050987-1-matthew.brost@intel.com
2024-11-18drm/xe/vram: drop 2G block restrictionMatthew Auld
Currently we limit the max block size for all users to ensure each block can fit within a sg entry (uint). Drop this restriction and tweak the sg construction to instead handle this itself and break down blocks which are too big, if needed. Most users don't need an sg list in the first place. Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Tested-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241113172346.256165-2-matthew.auld@intel.com
2024-11-15drm/xe: Split xe_gt_stat.hLucas De Marchi
Follow what's done for the other headers, with the types split into a separate header that can be included by other *_types.h headers. Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241114152148.572447-5-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2024-11-15drm/xe: Drop HAS_HECI_*Lucas De Marchi
Just do the same as for other has_* flags, without a macro. Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241114152148.572447-4-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2024-11-15drm/xe: Include xe_oa_types.hLucas De Marchi
xe_device_types.h and xe_gt_types.h only need to know about the xe_oa struct sizes. Include only the _types.h, like done for other components, and let the full header to be included by the compilation units. Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241114152148.572447-3-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2024-11-15drm/xe: Mark preempt fence workqueue as reclaimMatthew Brost
Preempt fences are in the path of reclaim, and we signal these fences in the preempt workqueue. With that, we need to mark the preempt fence workqueue with reclaim so that this workqueue can make forward progress during reclaim. Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs") Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241113171751.1677784-1-matthew.brost@intel.com
2024-11-15drm/xe/ufence: Wake up waiters after setting ufence->signalledNirmoy Das
If a previous ufence is not signalled, vm_bind will return -EBUSY. Delaying the modification of ufence->signalled can cause issues if the UMD reuses the same ufence so update ufence->signalled before waking up waiters. Cc: Matthew Brost <matthew.brost@intel.com> Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3233 Fixes: 977e5b82e090 ("drm/xe: Expose user fence from xe_sync_entry") Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241114150537.4161573-1-nirmoy.das@intel.com Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
2024-11-14drm/xe/guc: Remove duplicate source fieldZhanjun Dong
xe_hw_engine_snapshot.source save the information of where data copied from. Because the 'source' field is already populated inside 'matched_node' ptr hanging off xe_devcoredump_snapshot, which happenned either in guc_capture_extract_reglists or xe_engine_manual_capture, we can remove this redundant copy of 'source' from xe_hw_engine_snapshot. Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com> Reviewed-by: Alan Previn <alan.previn.teres.alexis@intel.com> Changes from prior revs: v2:- Update commit message Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241107213841.436384-1-zhanjun.dong@intel.com
2024-11-14drm/xe: Wire devcoredump to LR TDRMatthew Brost
LR queues can hang, cause engine reset, or cause IOMMU CAT errors. Collect an error capture when this occurs. v2: - s/queue's/queues (Jonathan) v4: - Fix build (CI) Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241114022522.1951351-8-matthew.brost@intel.com
2024-11-14drm/xe: Change xe_engine_snapshot_capture_for_job to be for queueMatthew Brost
During capture time, the target job may be unavailable (e.g., if it's in LR mode). However, the associated exec queue will be available regardless, change xe_engine_snapshot_capture_for_job to take a queue argument ann rename to xe_engine_snapshot_capture_for_queue. v2: - Reword commit message (Jonathan) - Remove redundant queueu check (Zhanjun) - Remove devcoredump job member (Zhanjun) Cc: Zhanjun Dong <zhanjun.dong@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241114022522.1951351-7-matthew.brost@intel.com
2024-11-14drm/xe: Improve schedule disable response failureMatthew Brost
Print Guc ID and take devcoredump on schedule disable response failure. GuC ID is useful information and a schedule disable response failure is possible the LRC state is corrupted so a devcoredump is helpful to debug. Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241114022522.1951351-6-matthew.brost@intel.com
2024-11-14drm/xe: Add exec queue param to devcoredumpMatthew Brost
During capture time, the target job may be unavailable (e.g., if it's in LR mode). However, the associated exec queue will be available regardless, so add an exec queue param for such cases. v2: - Reword commit message (Jonathan) Cc: Zhanjun Dong <zhanjun.dong@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241114022522.1951351-5-matthew.brost@intel.com
2024-11-14drm/xe: Add ring start to LRC snapshotMatthew Brost
Add LRC ring start register to LRC snapshot to verify no LRC register corruption upon hang. This could be possible if the indirect ring state was mapped to user space or via an internal KMD memory corruption. Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241114022522.1951351-4-matthew.brost@intel.com
2024-11-14drm/xe: Add ring address to LRC snapshotMatthew Brost
The ring is currently in LRC BO but this may change going forward. Include the ring address in the snapshot protecting again any future changes. v2: - s/ring_desc/ring_addr (Jonathan) Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241114022522.1951351-3-matthew.brost@intel.com
2024-11-14drm/xe: Add xe_ring_lrc_is_idle() helperMatthew Brost
Add helper to compare ring head and tail to determine if LRC is idle. v2: - Fix kernel doc (CI, Zhanjun) Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241114022522.1951351-2-matthew.brost@intel.com
2024-11-14drm/xe: Ignore GGTT TLB inval errors during GT resetNirmoy Das
During GT reset, GGTT TLB invalidations may fail. This is acceptable as the reset will clear GGTT caches. Suppress only -ECANCELED other return codes are still unexpected error. v2: Add code comment(Matt). Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3389 Suggested-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241112170123.1236443-1-nirmoy.das@intel.com Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
2024-11-14drm/xe: Allow fault injection in vm create and vm bind IOCTLsFrancois Dugast
Use fault injection infrastructure to allow specific functions to be configured over debugfs for failing during the execution of xe_vm_create_ioctl() and xe_vm_bind_ioctl(). This allows more thorough testing from user space by going through code paths for error handling and unwinding which cannot be reached by simply injecting errors in IOCTL arguments. This can help increase code robustness. v2: Add xe_pt_update_ops_{prepare,run} (Matthew Brost) Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241113162212.2154103-1-francois.dugast@intel.com Signed-off-by: Francois Dugast <francois.dugast@intel.com>
2024-11-13drm/xe/oa: Fix "Missing outer runtime PM protection" warningAshutosh Dixit
Fix the following drm_WARN: [953.586396] xe 0000:00:02.0: [drm] Missing outer runtime PM protection ... <4> [953.587090] ? xe_pm_runtime_get_noresume+0x8d/0xa0 [xe] <4> [953.587208] guc_exec_queue_add_msg+0x28/0x130 [xe] <4> [953.587319] guc_exec_queue_fini+0x3a/0x40 [xe] <4> [953.587425] xe_exec_queue_destroy+0xb3/0xf0 [xe] <4> [953.587515] xe_oa_release+0x9c/0xc0 [xe] Suggested-by: John Harrison <john.c.harrison@intel.com> Suggested-by: Matthew Brost <matthew.brost@intel.com> Fixes: e936f885f1e9 ("drm/xe/oa/uapi: Expose OA stream fd") Cc: stable@vger.kernel.org Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241109032003.3093811-1-ashutosh.dixit@intel.com
2024-11-13drm/xe: Sample gpu timestamp closer to exec queuesLucas De Marchi
Move the force_wake_get to the beginning of the function so the gpu timestamp can be closer to the sampled value for exec queues. This avoids additional delays waiting for force wake ack which can make the proportion between cycles/total_cycles fluctuate around the real value. For a gputop-like application getting 2 samples to calculate the utilization: sample 0: read_exec_queue_timestamp <<<< (A) read_gpu_timestamp sample 1: read_exec_queue_timestamp <<<<< (B) read_gpu_timestamp In the above case, utilization can be bigger or smaller than it should be, depending on if (A) or (B) receives additional delay, respectively. With this a LNL system that was failing on `xe_drm_fdinfo --r utilization-single-full-load` after ~60 iterations, get to run to 100 without a failure. This is still not perfect, and it's easy to introduce errors by just loading the CPU with `stress --cpu $(nproc)` - the same igt test in this case fails after 2 or 3 iterations. That will be dealt with in the test itself, using a longer sampling period. v2: Rename function and add another to get "any engine", preparing for caching the hwe in future (Umesh / Jonathan) Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241108053318.3483678-3-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2024-11-13drm/xe: Wait on killed exec queuesLucas De Marchi
When an exec queue is killed it triggers an async process of asking the GuC to schedule the context out. The timestamp in the context image is only updated when this process completes. In case a userspace process kills an exec and tries to read the timestamp, it may not get an updated runtime. Add synchronization between the process reading the fdinfo and the exec queue being killed. After reading all the timestamps, wait on exec queues in the process of being killed. When that wait is over, xe_exec_queue_fini() was already called and updated the timestamps. v2: Do not update pending_removal before validating user args (Matthew Auld) v3: Move wait on pending to be done before getting any timestamp so it's more likely for the gpu and exec queue timestamps to be closer together Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2667 Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241108053318.3483678-2-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2024-11-13drm/xe: handle flat ccs during hibernation on igpuMatthew Auld
Starting from LNL, CCS has moved over to flat CCS model where there is now dedicated memory reserved for storing compression state. On platforms like LNL this reserved memory lives inside graphics stolen memory, which is not treated like normal RAM and is therefore skipped by the core kernel when creating the hibernation image. Currently if something was compressed and we enter hibernation all the corresponding CCS state is lost on such HW, resulting in corrupted memory. To fix this evict user buffers from TT -> SYSTEM to ensure we take a snapshot of the raw CCS state when entering hibernation, where upon resuming we can restore the raw CCS state back when next validating the buffer. This has been confirmed to fix display corruption on LNL when coming back from hibernation. Fixes: cbdc52c11c9b ("drm/xe/xe2: Support flat ccs") Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3409 Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: <stable@vger.kernel.org> # v6.8+ Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241112162827.116523-2-matthew.auld@intel.com
2024-11-12drm/xe/guc: Support crash dump notification from GuCJohn Harrison
Add support for the two crash dump notifications from GuC. Either one means GuC is toast, so just capture state trigger a reset. Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241108212737.2044007-3-John.C.Harrison@Intel.com
2024-11-12drm/xe/guc: Reduce default GuC log verbosityJohn Harrison
Drop the default verbosity from 5 (max) to 3 as the extra verbosity generally doesn't provide anything vitally important but does cause rapid log overflow. Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241108212737.2044007-2-John.C.Harrison@Intel.com
2024-11-12drm/xe/gsc: Improve SW proxy error checking and loggingDaniele Ceraolo Spurio
If an error occurs in the GSC<->CSME handshake, the GSC will send a PROXY_END msg to the driver with the status set to an error code. We currently don't check the status when receiving a PROXY_END message and instead check the proxy initialization status in the FWSTS reg; therefore, while still catching any initialization failures, we lose the actual returned error code. This can be easily improved by checking the status value and printing it to dmesg if it's an error. Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: Alan Previn <alan.previn.teres.alexis@intel.com> Reviewed-by: Alan Previn <alan.previn.teres.alexis@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241106002402.486700-1-daniele.ceraolospurio@intel.com
2024-11-12drm/xe: Take job list lock in xe_sched_first_pending_jobNirmoy Das
Access to the pending_list should always happens under job_list_lock. Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241105160327.2970277-1-nirmoy.das@intel.com Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
2024-11-09drm/xe/guc: Do not assert CTB state while sending MMIOTomasz Lis
During VF post-migration recovery, MMIO communication channel to GuC is used, despite CTB channel being enabled. This behavior is rooted in the save-restore architecture specification. Therefore, a VF driver cannot assert that CTB is disabled while sending MMIO messages to GuC. Such assertion needs to be PF only, or be removed. This patch simply removes the assertion. Signed-off-by: Tomasz Lis <tomasz.lis@intel.com> Suggested-by: Michał Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241107151357.1623733-2-tomasz.lis@intel.com
2024-11-09drm/xe/guc: Prefer GT oriented logs in submit codeMichal Wajdeczko
For better diagnostics, use xe_gt_err() instead of drm_err(). Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241107194741.2167-3-michal.wajdeczko@intel.com
2024-11-09drm/xe/guc: Prefer GT oriented asserts in submit codeMichal Wajdeczko
For better diagnostics, use xe_gt_assert() instead of xe_assert(). Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241107194741.2167-2-michal.wajdeczko@intel.com
2024-11-08drm/xe: improve hibernation on igpuMatthew Auld
The GGTT looks to be stored inside stolen memory on igpu which is not treated as normal RAM. The core kernel skips this memory range when creating the hibernation image, therefore when coming back from hibernation the GGTT programming is lost. This seems to cause issues with broken resume where GuC FW fails to load: [drm] *ERROR* GT0: load failed: status = 0x400000A0, time = 10ms, freq = 1250MHz (req 1300MHz), done = -1 [drm] *ERROR* GT0: load failed: status: Reset = 0, BootROM = 0x50, UKernel = 0x00, MIA = 0x00, Auth = 0x01 [drm] *ERROR* GT0: firmware signature verification failed [drm] *ERROR* CRITICAL: Xe has declared device 0000:00:02.0 as wedged. Current GGTT users are kernel internal and tracked as pinned, so it should be possible to hook into the existing save/restore logic that we use for dgpu, where the actual evict is skipped but on restore we importantly restore the GGTT programming. This has been confirmed to fix hibernation on at least ADL and MTL, though likely all igpu platforms are affected. This also means we have a hole in our testing, where the existing s4 tests only really test the driver hooks, and don't go as far as actually rebooting and restoring from the hibernation image and in turn powering down RAM (and therefore losing the contents of stolen). v2 (Brost) - Remove extra newline and drop unnecessary parentheses. Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs") Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3275 Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: <stable@vger.kernel.org> # v6.8+ Reviewed-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241101170156.213490-2-matthew.auld@intel.com
2024-11-08drm/xe: Add gt_id to xe_sched_job tracesLucas De Marchi
In order to uniquely identify the jobs, xe_sched_job tracepoints need to have the tuple (gt_id, guc_id). Find a "hole" where the next entry is unaligned and add one more field. Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241107060606.3130885-1-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>