summaryrefslogtreecommitdiff
path: root/drivers/gpu/drm/xe
AgeCommit message (Collapse)Author
2024-11-21drm/xe: Mark preempt fence workqueue as reclaimMatthew Brost
Preempt fences are in the path of reclaim, and we signal these fences in the preempt workqueue. With that, we need to mark the preempt fence workqueue with reclaim so that this workqueue can make forward progress during reclaim. Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs") Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241113171751.1677784-1-matthew.brost@intel.com (cherry picked from commit 15cf53ece41748a102f4b5ee26947c2ec059bf95) Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2024-11-21drm/xe/ufence: Wake up waiters after setting ufence->signalledNirmoy Das
If a previous ufence is not signalled, vm_bind will return -EBUSY. Delaying the modification of ufence->signalled can cause issues if the UMD reuses the same ufence so update ufence->signalled before waking up waiters. Cc: Matthew Brost <matthew.brost@intel.com> Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3233 Fixes: 977e5b82e090 ("drm/xe: Expose user fence from xe_sync_entry") Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241114150537.4161573-1-nirmoy.das@intel.com Signed-off-by: Nirmoy Das <nirmoy.das@intel.com> (cherry picked from commit 553a5d14fcd927194c409b10faced6a6dbc678d1) Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2024-11-20Merge tag 'asm-generic-3.13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic Pull asm-generic updates from Arnd Bergmann: "These are a number of unrelated cleanups, generally simplifying the architecture specific header files: - A series from Al Viro simplifies asm/vga.h, after it turns out that most of it can be generalized. - A series from Julian Vetter adds a common version of memcpy_{to,from}io() and memset_io() and changes most architectures to use that instead of their own implementation - A series from Niklas Schnelle concludes his work to make PC style inb()/outb() optional - Nicolas Pitre contributes improvements for the generic do_div() helper - Christoph Hellwig adds a generic version of page_to_phys() and phys_to_page(), replacing the slightly different architecture specific definitions. - Uwe Kleine-Koenig has a minor cleanup for ioctl definitions" * tag 'asm-generic-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic: (24 commits) empty include/asm-generic/vga.h sparc: get rid of asm/vga.h asm/vga.h: don't bother with scr_mem{cpy,move}v() unless we need to vt_buffer.h: get rid of dead code in default scr_...() instances tty: serial: export serial_8250_warn_need_ioport lib/iomem_copy: fix kerneldoc format style hexagon: simplify asm/io.h for !HAS_IOPORT loongarch: Use new fallback IO memcpy/memset csky: Use new fallback IO memcpy/memset arm64: Use new fallback IO memcpy/memset New implementation for IO memcpy and IO memset watchdog: Add HAS_IOPORT dependency for SBC8360 and SBC7240 __arch_xprod64(): make __always_inline when optimizing for performance ARM: div64: improve __arch_xprod_64() asm-generic/div64: optimize/simplify __div64_const32() lib/math/test_div64: add some edge cases relevant to __div64_const32() asm-generic: add an optional pfn_valid check to page_to_phys asm-generic: provide generic page_to_phys and phys_to_page implementations asm-generic/io.h: Remove I/O port accessors for HAS_IOPORT=n tty: serial: handle HAS_IOPORT dependencies ...
2024-11-19drm/xe: drop unused component dependenciesChristian König
XE switched over to drm_exec quite some time ago. Signed-off-by: Christian König <christian.koenig@amd.com> Acked-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241114153020.6209-7-christian.koenig@amd.com
2024-11-18drm/xe: Drop useless d3cold allowed messageMatthew Brost
This message just spams dmesg providing little benefit. Remove it. Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241115192155.2050987-1-matthew.brost@intel.com
2024-11-18drm/xe/vram: drop 2G block restrictionMatthew Auld
Currently we limit the max block size for all users to ensure each block can fit within a sg entry (uint). Drop this restriction and tweak the sg construction to instead handle this itself and break down blocks which are too big, if needed. Most users don't need an sg list in the first place. Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Tested-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241113172346.256165-2-matthew.auld@intel.com
2024-11-15drm/xe: Split xe_gt_stat.hLucas De Marchi
Follow what's done for the other headers, with the types split into a separate header that can be included by other *_types.h headers. Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241114152148.572447-5-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2024-11-15drm/xe: Drop HAS_HECI_*Lucas De Marchi
Just do the same as for other has_* flags, without a macro. Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241114152148.572447-4-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2024-11-15drm/xe: Include xe_oa_types.hLucas De Marchi
xe_device_types.h and xe_gt_types.h only need to know about the xe_oa struct sizes. Include only the _types.h, like done for other components, and let the full header to be included by the compilation units. Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241114152148.572447-3-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2024-11-15drm/xe: Mark preempt fence workqueue as reclaimMatthew Brost
Preempt fences are in the path of reclaim, and we signal these fences in the preempt workqueue. With that, we need to mark the preempt fence workqueue with reclaim so that this workqueue can make forward progress during reclaim. Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs") Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241113171751.1677784-1-matthew.brost@intel.com
2024-11-15drm/xe/ufence: Wake up waiters after setting ufence->signalledNirmoy Das
If a previous ufence is not signalled, vm_bind will return -EBUSY. Delaying the modification of ufence->signalled can cause issues if the UMD reuses the same ufence so update ufence->signalled before waking up waiters. Cc: Matthew Brost <matthew.brost@intel.com> Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3233 Fixes: 977e5b82e090 ("drm/xe: Expose user fence from xe_sync_entry") Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241114150537.4161573-1-nirmoy.das@intel.com Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
2024-11-14drm/xe/guc: Remove duplicate source fieldZhanjun Dong
xe_hw_engine_snapshot.source save the information of where data copied from. Because the 'source' field is already populated inside 'matched_node' ptr hanging off xe_devcoredump_snapshot, which happenned either in guc_capture_extract_reglists or xe_engine_manual_capture, we can remove this redundant copy of 'source' from xe_hw_engine_snapshot. Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com> Reviewed-by: Alan Previn <alan.previn.teres.alexis@intel.com> Changes from prior revs: v2:- Update commit message Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241107213841.436384-1-zhanjun.dong@intel.com
2024-11-14drm/{i915, xe}: Move power_domains suspend/resume to display_powerRodrigo Vivi
Move intel_power_domains_{suspend,resume} to inside intel_display_power_{suspend_late, resume_early}. With this also change the VLV suspend failure to call the intel_display_power_resume_early. In the end, the only function executed there for VLV is the intel_power_domains_resume. Besides make the code more consistency give the call that was immediately before: intel_display_power_suspend_late. Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241113225016.208673-7-rodrigo.vivi@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-11-14drm/xe/display: Delay dsm handler registrationRodrigo Vivi
Bring some consistency to register/unregister order at the same time it aligns with i915 sequence order. Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241113225016.208673-6-rodrigo.vivi@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-11-14drm/xe/display: Delay hpd_init resumeRodrigo Vivi
Align with i915 and only initialize hotplugs after the display driver access has been resumed. Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241113225016.208673-5-rodrigo.vivi@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-11-14drm/{i915, xe}/display: Move DP MST calls to display_driverRodrigo Vivi
Move dp_mst suspend/resume functions from the drivers towards intel_display_driver to continue with the unification. Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241113225016.208673-4-rodrigo.vivi@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-11-14drm/xe: Wire devcoredump to LR TDRMatthew Brost
LR queues can hang, cause engine reset, or cause IOMMU CAT errors. Collect an error capture when this occurs. v2: - s/queue's/queues (Jonathan) v4: - Fix build (CI) Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241114022522.1951351-8-matthew.brost@intel.com
2024-11-14drm/xe: Change xe_engine_snapshot_capture_for_job to be for queueMatthew Brost
During capture time, the target job may be unavailable (e.g., if it's in LR mode). However, the associated exec queue will be available regardless, change xe_engine_snapshot_capture_for_job to take a queue argument ann rename to xe_engine_snapshot_capture_for_queue. v2: - Reword commit message (Jonathan) - Remove redundant queueu check (Zhanjun) - Remove devcoredump job member (Zhanjun) Cc: Zhanjun Dong <zhanjun.dong@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241114022522.1951351-7-matthew.brost@intel.com
2024-11-14drm/xe: Improve schedule disable response failureMatthew Brost
Print Guc ID and take devcoredump on schedule disable response failure. GuC ID is useful information and a schedule disable response failure is possible the LRC state is corrupted so a devcoredump is helpful to debug. Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241114022522.1951351-6-matthew.brost@intel.com
2024-11-14drm/xe: Add exec queue param to devcoredumpMatthew Brost
During capture time, the target job may be unavailable (e.g., if it's in LR mode). However, the associated exec queue will be available regardless, so add an exec queue param for such cases. v2: - Reword commit message (Jonathan) Cc: Zhanjun Dong <zhanjun.dong@intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241114022522.1951351-5-matthew.brost@intel.com
2024-11-14drm/xe: Add ring start to LRC snapshotMatthew Brost
Add LRC ring start register to LRC snapshot to verify no LRC register corruption upon hang. This could be possible if the indirect ring state was mapped to user space or via an internal KMD memory corruption. Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241114022522.1951351-4-matthew.brost@intel.com
2024-11-14drm/xe: Add ring address to LRC snapshotMatthew Brost
The ring is currently in LRC BO but this may change going forward. Include the ring address in the snapshot protecting again any future changes. v2: - s/ring_desc/ring_addr (Jonathan) Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241114022522.1951351-3-matthew.brost@intel.com
2024-11-14drm/xe: Add xe_ring_lrc_is_idle() helperMatthew Brost
Add helper to compare ring head and tail to determine if LRC is idle. v2: - Fix kernel doc (CI, Zhanjun) Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241114022522.1951351-2-matthew.brost@intel.com
2024-11-14drm/xe/guc: Fix dereference before NULL checkEverest K.C.
The pointer list->list is dereferenced before the NULL check. Fix this by moving the NULL check outside the for loop, so that the check is performed before the dereferencing. The list->list pointer cannot be NULL so this has no effect on runtime. It's just a correctness issue. This issue was reported by Coverity Scan. https://scan7.scan.coverity.com/#/project-view/51525/11354?selectedIssue=1600335 Fixes: 0f1fdf559225 ("drm/xe/guc: Save manual engine capture into capture list") Signed-off-by: Everest K.C. <everestkc@everestkc.com.np> Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241023233356.5479-1-everestkc@everestkc.com.np (cherry picked from commit 2aff81e039de5b0b7ef6bdcb2c320f121f69e2b4) Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2024-11-14drm/xe: Ignore GGTT TLB inval errors during GT resetNirmoy Das
During GT reset, GGTT TLB invalidations may fail. This is acceptable as the reset will clear GGTT caches. Suppress only -ECANCELED other return codes are still unexpected error. v2: Add code comment(Matt). Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3389 Suggested-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241112170123.1236443-1-nirmoy.das@intel.com Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
2024-11-14drm/xe: Allow fault injection in vm create and vm bind IOCTLsFrancois Dugast
Use fault injection infrastructure to allow specific functions to be configured over debugfs for failing during the execution of xe_vm_create_ioctl() and xe_vm_bind_ioctl(). This allows more thorough testing from user space by going through code paths for error handling and unwinding which cannot be reached by simply injecting errors in IOCTL arguments. This can help increase code robustness. v2: Add xe_pt_update_ops_{prepare,run} (Matthew Brost) Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241113162212.2154103-1-francois.dugast@intel.com Signed-off-by: Francois Dugast <francois.dugast@intel.com>
2024-11-13drm/xe/oa: Fix "Missing outer runtime PM protection" warningAshutosh Dixit
Fix the following drm_WARN: [953.586396] xe 0000:00:02.0: [drm] Missing outer runtime PM protection ... <4> [953.587090] ? xe_pm_runtime_get_noresume+0x8d/0xa0 [xe] <4> [953.587208] guc_exec_queue_add_msg+0x28/0x130 [xe] <4> [953.587319] guc_exec_queue_fini+0x3a/0x40 [xe] <4> [953.587425] xe_exec_queue_destroy+0xb3/0xf0 [xe] <4> [953.587515] xe_oa_release+0x9c/0xc0 [xe] Suggested-by: John Harrison <john.c.harrison@intel.com> Suggested-by: Matthew Brost <matthew.brost@intel.com> Fixes: e936f885f1e9 ("drm/xe/oa/uapi: Expose OA stream fd") Cc: stable@vger.kernel.org Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241109032003.3093811-1-ashutosh.dixit@intel.com (cherry picked from commit b107c63d2953907908fd0cafb0e543b3c3167b75) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2024-11-13drm/xe/oa: Fix "Missing outer runtime PM protection" warningAshutosh Dixit
Fix the following drm_WARN: [953.586396] xe 0000:00:02.0: [drm] Missing outer runtime PM protection ... <4> [953.587090] ? xe_pm_runtime_get_noresume+0x8d/0xa0 [xe] <4> [953.587208] guc_exec_queue_add_msg+0x28/0x130 [xe] <4> [953.587319] guc_exec_queue_fini+0x3a/0x40 [xe] <4> [953.587425] xe_exec_queue_destroy+0xb3/0xf0 [xe] <4> [953.587515] xe_oa_release+0x9c/0xc0 [xe] Suggested-by: John Harrison <john.c.harrison@intel.com> Suggested-by: Matthew Brost <matthew.brost@intel.com> Fixes: e936f885f1e9 ("drm/xe/oa/uapi: Expose OA stream fd") Cc: stable@vger.kernel.org Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241109032003.3093811-1-ashutosh.dixit@intel.com
2024-11-13drm/xe: Sample gpu timestamp closer to exec queuesLucas De Marchi
Move the force_wake_get to the beginning of the function so the gpu timestamp can be closer to the sampled value for exec queues. This avoids additional delays waiting for force wake ack which can make the proportion between cycles/total_cycles fluctuate around the real value. For a gputop-like application getting 2 samples to calculate the utilization: sample 0: read_exec_queue_timestamp <<<< (A) read_gpu_timestamp sample 1: read_exec_queue_timestamp <<<<< (B) read_gpu_timestamp In the above case, utilization can be bigger or smaller than it should be, depending on if (A) or (B) receives additional delay, respectively. With this a LNL system that was failing on `xe_drm_fdinfo --r utilization-single-full-load` after ~60 iterations, get to run to 100 without a failure. This is still not perfect, and it's easy to introduce errors by just loading the CPU with `stress --cpu $(nproc)` - the same igt test in this case fails after 2 or 3 iterations. That will be dealt with in the test itself, using a longer sampling period. v2: Rename function and add another to get "any engine", preparing for caching the hwe in future (Umesh / Jonathan) Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241108053318.3483678-3-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2024-11-13drm/xe: Wait on killed exec queuesLucas De Marchi
When an exec queue is killed it triggers an async process of asking the GuC to schedule the context out. The timestamp in the context image is only updated when this process completes. In case a userspace process kills an exec and tries to read the timestamp, it may not get an updated runtime. Add synchronization between the process reading the fdinfo and the exec queue being killed. After reading all the timestamps, wait on exec queues in the process of being killed. When that wait is over, xe_exec_queue_fini() was already called and updated the timestamps. v2: Do not update pending_removal before validating user args (Matthew Auld) v3: Move wait on pending to be done before getting any timestamp so it's more likely for the gpu and exec queue timestamps to be closer together Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2667 Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com> Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241108053318.3483678-2-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2024-11-13drm/xe: handle flat ccs during hibernation on igpuMatthew Auld
Starting from LNL, CCS has moved over to flat CCS model where there is now dedicated memory reserved for storing compression state. On platforms like LNL this reserved memory lives inside graphics stolen memory, which is not treated like normal RAM and is therefore skipped by the core kernel when creating the hibernation image. Currently if something was compressed and we enter hibernation all the corresponding CCS state is lost on such HW, resulting in corrupted memory. To fix this evict user buffers from TT -> SYSTEM to ensure we take a snapshot of the raw CCS state when entering hibernation, where upon resuming we can restore the raw CCS state back when next validating the buffer. This has been confirmed to fix display corruption on LNL when coming back from hibernation. Fixes: cbdc52c11c9b ("drm/xe/xe2: Support flat ccs") Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3409 Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: <stable@vger.kernel.org> # v6.8+ Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241112162827.116523-2-matthew.auld@intel.com (cherry picked from commit c8b3c6db941299d7cc31bd9befed3518fdebaf68) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2024-11-13drm/xe: improve hibernation on igpuMatthew Auld
The GGTT looks to be stored inside stolen memory on igpu which is not treated as normal RAM. The core kernel skips this memory range when creating the hibernation image, therefore when coming back from hibernation the GGTT programming is lost. This seems to cause issues with broken resume where GuC FW fails to load: [drm] *ERROR* GT0: load failed: status = 0x400000A0, time = 10ms, freq = 1250MHz (req 1300MHz), done = -1 [drm] *ERROR* GT0: load failed: status: Reset = 0, BootROM = 0x50, UKernel = 0x00, MIA = 0x00, Auth = 0x01 [drm] *ERROR* GT0: firmware signature verification failed [drm] *ERROR* CRITICAL: Xe has declared device 0000:00:02.0 as wedged. Current GGTT users are kernel internal and tracked as pinned, so it should be possible to hook into the existing save/restore logic that we use for dgpu, where the actual evict is skipped but on restore we importantly restore the GGTT programming. This has been confirmed to fix hibernation on at least ADL and MTL, though likely all igpu platforms are affected. This also means we have a hole in our testing, where the existing s4 tests only really test the driver hooks, and don't go as far as actually rebooting and restoring from the hibernation image and in turn powering down RAM (and therefore losing the contents of stolen). v2 (Brost) - Remove extra newline and drop unnecessary parentheses. Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs") Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3275 Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: <stable@vger.kernel.org> # v6.8+ Reviewed-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241101170156.213490-2-matthew.auld@intel.com (cherry picked from commit f2a6b8e396666d97ada8e8759dfb6a69d8df6380) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2024-11-13drm/xe: Restore system memory GGTT mappingsMatthew Brost
GGTT mappings reside on the device and this state is lost during suspend / d3cold thus this state must be restored resume regardless if the BO is in system memory or VRAM. v2: - Unnecessary parentheses around bo->placements[0] (Checkpatch) Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241031182257.2949579-1-matthew.brost@intel.com (cherry picked from commit a19d1db9a3fa89fabd7c83544b84f393ee9b851f) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2024-11-13drm/xe: Ensure all locks released in exec IOCTLMatthew Brost
In couple of places the wrong error handling goto was used to release locks. Fix these to ensure all locks dropped on exec IOCTL errors. Cc: Francois Dugast <francois.dugast@intel.com> Fixes: d16ef1a18e39 ("drm/xe/exec: Switch hw engine group execution mode upon job submission") Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Francois Dugast <francois.dugast@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241106224944.30130-1-matthew.brost@intel.com (cherry picked from commit 9e7aacd8402b88394e6a83cb242901fde77a1773) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2024-11-13drm/xe: handle flat ccs during hibernation on igpuMatthew Auld
Starting from LNL, CCS has moved over to flat CCS model where there is now dedicated memory reserved for storing compression state. On platforms like LNL this reserved memory lives inside graphics stolen memory, which is not treated like normal RAM and is therefore skipped by the core kernel when creating the hibernation image. Currently if something was compressed and we enter hibernation all the corresponding CCS state is lost on such HW, resulting in corrupted memory. To fix this evict user buffers from TT -> SYSTEM to ensure we take a snapshot of the raw CCS state when entering hibernation, where upon resuming we can restore the raw CCS state back when next validating the buffer. This has been confirmed to fix display corruption on LNL when coming back from hibernation. Fixes: cbdc52c11c9b ("drm/xe/xe2: Support flat ccs") Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3409 Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: <stable@vger.kernel.org> # v6.8+ Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241112162827.116523-2-matthew.auld@intel.com
2024-11-12drm/xe/guc: Support crash dump notification from GuCJohn Harrison
Add support for the two crash dump notifications from GuC. Either one means GuC is toast, so just capture state trigger a reset. Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241108212737.2044007-3-John.C.Harrison@Intel.com
2024-11-12drm/xe/guc: Reduce default GuC log verbosityJohn Harrison
Drop the default verbosity from 5 (max) to 3 as the extra verbosity generally doesn't provide anything vitally important but does cause rapid log overflow. Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241108212737.2044007-2-John.C.Harrison@Intel.com
2024-11-12drm/xe/gsc: Improve SW proxy error checking and loggingDaniele Ceraolo Spurio
If an error occurs in the GSC<->CSME handshake, the GSC will send a PROXY_END msg to the driver with the status set to an error code. We currently don't check the status when receiving a PROXY_END message and instead check the proxy initialization status in the FWSTS reg; therefore, while still catching any initialization failures, we lose the actual returned error code. This can be easily improved by checking the status value and printing it to dmesg if it's an error. Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: Alan Previn <alan.previn.teres.alexis@intel.com> Reviewed-by: Alan Previn <alan.previn.teres.alexis@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241106002402.486700-1-daniele.ceraolospurio@intel.com
2024-11-12drm/xe: Take job list lock in xe_sched_first_pending_jobNirmoy Das
Access to the pending_list should always happens under job_list_lock. Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241105160327.2970277-1-nirmoy.das@intel.com Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
2024-11-12drm/i915/display: pass struct pci_dev * to intel_display_device_probe()Jani Nikula
Convert intel_display_device_probe() to accept struct pci_dev * instead of struct drm_i915_private *. Return struct intel_display * in preparation of allocating the memory of it later. Reviewed-by: Vinod Govindapillai <vinod.govindapillai@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/ab4e960e3fff46cbeba185882b1e554f0ccd5877.1731321183.git.jani.nikula@intel.com Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2024-11-12drm/i915/display: convert display device identification to struct intel_displayJani Nikula
Convert intel_display_device.[ch] to struct intel_display, including callers, but excluding intel_display_device_probe() which will be handled in follow-up. v2: fix display->drm = display->drm goof-up Reviewed-by: Vinod Govindapillai <vinod.govindapillai@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/865b27b66f599e707081d46fca9f679e19a4e8aa.1731321183.git.jani.nikula@intel.com Signed-off-by: Jani Nikula <jani.nikula@intel.com>
2024-11-09drm/xe/guc: Do not assert CTB state while sending MMIOTomasz Lis
During VF post-migration recovery, MMIO communication channel to GuC is used, despite CTB channel being enabled. This behavior is rooted in the save-restore architecture specification. Therefore, a VF driver cannot assert that CTB is disabled while sending MMIO messages to GuC. Such assertion needs to be PF only, or be removed. This patch simply removes the assertion. Signed-off-by: Tomasz Lis <tomasz.lis@intel.com> Suggested-by: Michał Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241107151357.1623733-2-tomasz.lis@intel.com
2024-11-09drm/xe/guc: Prefer GT oriented logs in submit codeMichal Wajdeczko
For better diagnostics, use xe_gt_err() instead of drm_err(). Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241107194741.2167-3-michal.wajdeczko@intel.com
2024-11-09drm/xe/guc: Prefer GT oriented asserts in submit codeMichal Wajdeczko
For better diagnostics, use xe_gt_assert() instead of xe_assert(). Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241107194741.2167-2-michal.wajdeczko@intel.com
2024-11-08drm/xe: improve hibernation on igpuMatthew Auld
The GGTT looks to be stored inside stolen memory on igpu which is not treated as normal RAM. The core kernel skips this memory range when creating the hibernation image, therefore when coming back from hibernation the GGTT programming is lost. This seems to cause issues with broken resume where GuC FW fails to load: [drm] *ERROR* GT0: load failed: status = 0x400000A0, time = 10ms, freq = 1250MHz (req 1300MHz), done = -1 [drm] *ERROR* GT0: load failed: status: Reset = 0, BootROM = 0x50, UKernel = 0x00, MIA = 0x00, Auth = 0x01 [drm] *ERROR* GT0: firmware signature verification failed [drm] *ERROR* CRITICAL: Xe has declared device 0000:00:02.0 as wedged. Current GGTT users are kernel internal and tracked as pinned, so it should be possible to hook into the existing save/restore logic that we use for dgpu, where the actual evict is skipped but on restore we importantly restore the GGTT programming. This has been confirmed to fix hibernation on at least ADL and MTL, though likely all igpu platforms are affected. This also means we have a hole in our testing, where the existing s4 tests only really test the driver hooks, and don't go as far as actually rebooting and restoring from the hibernation image and in turn powering down RAM (and therefore losing the contents of stolen). v2 (Brost) - Remove extra newline and drop unnecessary parentheses. Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs") Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3275 Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: <stable@vger.kernel.org> # v6.8+ Reviewed-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241101170156.213490-2-matthew.auld@intel.com
2024-11-08drm/xe: Add gt_id to xe_sched_job tracesLucas De Marchi
In order to uniquely identify the jobs, xe_sched_job tracepoints need to have the tuple (gt_id, guc_id). Find a "hole" where the next entry is unaligned and add one more field. Reviewed-by: Matthew Brost <matthew.brost@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241107060606.3130885-1-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2024-11-08drm/xe: Mimic i915 behavior for non-sleeping MMIO waitGustavo Sousa
In upcoming display changes, we will modify the DMC wakelock MMIO waiting code to choose a non-sleeping variant implementation, because the wakelock is also taking in atomic context. While xe provides an explicit parameter (namely "atomic") to prevent xe_mmio_wait32() from sleeping, i915 does not and implements that behavior when slow_timeout_ms is zero. So, for now, let's mimic what i915 does to allow for display to use non-sleeping MMIO wait. In the future, we should come up with a better and explicit interface for this behavior in i915, at least while display code is not an independent entity with proper interfaces between xe and i915. v2: - Make the tone in comment the comment added in __intel_wait_for_register() more explanatory than a FIXME-like text. (Luca) Reviewed-by: Luca Coelho <luciano.coelho@intel.com> Signed-off-by: Gustavo Sousa <gustavo.sousa@intel.com> Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241108130218.24125-3-gustavo.sousa@intel.com
2024-11-08drm/xe/pf: Adjust scheduling priority based on policy changeMichal Wajdeczko
Local values of scheduling priorities need to be adjusted when changing the schedule_if_idle policy as those are related. Disabling schedule_if_idle policy forces the low priority, and enabling schedule_if_idle policy forces the normal priority. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: Lukasz Laguna <lukasz.laguna@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241106151301.2079-6-michal.wajdeczko@intel.com
2024-11-08drm/xe/pf: Allow to control scheduling priority using debugfsMichal Wajdeczko
Add 'sched_priority' attribute that will represents the scheduling prority of the VF or PF. Values {0,1,2} correspond to priorities {LOW,NORMAL,HIGH} respectively. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: Lukasz Laguna <lukasz.laguna@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241106151301.2079-5-michal.wajdeczko@intel.com
2024-11-08drm/xe/pf: Add functions to configure VF scheduling priorityMichal Wajdeczko
Add functions to configure PF or VF scheduling priority using the VF_CFG_SCHED_PRIORITY KLV. Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Reviewed-by: Lukasz Laguna <lukasz.laguna@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20241106151301.2079-4-michal.wajdeczko@intel.com