summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-05-23drm/amdgpu/display: Fix null pointer dereference in ↵Srinivasan Shanmugam
dc_stream_program_cursor_position The fix involves adding a null check for 'stream' at the beginning of the function. If 'stream' is NULL, the function immediately returns false. This ensures that 'stream' is not NULL when we dereference it to access 'ctx' in 'dc = stream->ctx->dc;' the function. Fixes the below: drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc_stream.c:398 dc_stream_program_cursor_position() error: we previously assumed 'stream' could be null (see line 397) drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc_stream.c 389 bool dc_stream_program_cursor_position( 390 struct dc_stream_state *stream, 391 const struct dc_cursor_position *position) 392 { 393 struct dc *dc; 394 bool reset_idle_optimizations = false; 395 const struct dc_cursor_position *old_position; 396 397 old_position = stream ? &stream->cursor_position : NULL; ^^^^^^^^ The patch adds a NULL check --> 398 dc = stream->ctx->dc; ^^^^^^^^ The old code didn't check 399 400 if (dc_stream_set_cursor_position(stream, position)) { 401 dc_z10_restore(dc); 402 403 /* disable idle optimizations if enabling cursor */ 404 if (dc->idle_optimizations_allowed && 405 (!old_position->enable || dc->debug.exit_idle_opt_for_cursor_updates) && 406 position->enable) { 407 dc_allow_idle_optimizations(dc, false); Fixes: f63f86b5affc ("drm/amd/display: Separate setting and programming of cursor") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Cc: Harry Wentland <harry.wentland@amd.com> Cc: Tom Chung <chiahsuan.chung@amd.com> Cc: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com> Cc: Roman Li <roman.li@amd.com> Cc: Aurabindo Pillai <aurabindo.pillai@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Harry Wentland <harry.wentland@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-23drm/amdgpu: add gfx queue support for gfx11 ipdumpSunil Khatri
Add support of all the CP GFX queues for gfx11 ipdump to be used by devcoredump. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-23drm/amdgpu: add cp queue registers for gfx11 ipdumpSunil Khatri
Add gfx11 support of CP queue registers for all queues to be used by devcoredump. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-23drm/amd/display: Pass errors from amdgpu_dm_init() upMario Limonciello
Errors in amdgpu_dm_init() are silently ignored and dm_hw_init() will succeed. However often these are fatal errors and it would be better to pass them up. Reviewed-by: Harry Wentland <harry.wentland@amd.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-23drm/amdgpu: add print support for gfx11 ipdumpSunil Khatri
Add support of gfx11 ipdump print so devcoredump could trigger it to dump the captured registers in devcoredump. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-23drm/amdgpu: add gfx11 registers support in ipdumpSunil Khatri
Add general registers of gfx11 in ipdump for devcoredump support. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-23drm/amdgpu: Add missing offsets in gc_11_0_0_offset.hSunil Khatri
IB1 registers: regCP_IB1_CMD_BUFSZ regCP_IB1_BASE_LO regCP_IB1_BASE_HI regCP_IB1_BUFSZ regCP_MES_DEBUG_INTERRUPT_INSTR_PNTR Above registers are part of the asic but not of the offset file for gc_11_0_0_offset.h and hence adding them. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-23drm/amdgpu: add more device info to the devcoredumpSunil Khatri
Adding more device information: a. PCI info b. VRAM and GTT info c. GDC config Also correct the print layout and section information for in devcoredump. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-23drm/amdgpu: add prints in IP State dumpSunil Khatri
add prints before and after ip state is dumped. It avoids user to think of system being stuck/hung as dump could take some time after a gpu hang. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-23drm/amdgpu: add gfx queue support of gfx10 in ipdumpSunil Khatri
Add gfx queue register for all instances in devcoredump for gfx10. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-23drm/amdgpu: Add cp queues support fro gfx10 in ipdumpSunil Khatri
Add support to dump registers of all instances of cp queue registers of gfx10 to devcoredump. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-23drm/amdgpu: rename the ip_dump to ip_dump_coreSunil Khatri
Rename the memory pointer from ip_dump to ip_dump_core to make it specific to core registers and rest other registers to be dumped in their respective memories. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-23drm/amdgpu: Add CRC16 selection in configLijo Lazar
KFD uses crc16 for gpu_id generation. Fixes: 3ed181b8ff43 ("drm/amdkfd: Ensure gpu_id is unique") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202405211405.TidTWIBX-lkp@intel.com/ Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-23drm/amd/pm: workaround to pass jpeg unit testKenneth Feng
this is a workaround to pass jpeg unit test on vcn 5.0 now. will be removed later. Signed-off-by: Kenneth Feng <kenneth.feng@amd.com> Reviewed-by: Sonny Jiang <sonny.jiang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-23drm/amd/amdgpu: fix the inst passed to amdgpu_virt_rlcg_reg_rwVictor Zhao
the inst passed to amdgpu_virt_rlcg_reg_rw should be physical instance. Fix the miss matched code. Signed-off-by: Victor Zhao <Victor.Zhao@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-23drm/amdgpu - optimize rlc spm cntlJane Jian
v1 - driver MMIO read the register to check whether write is required - if write is required, sriov full time to use rlcg, otherwise use KIQ v2 - include gfx v11 sriov runtime case Signed-off-by: Jane Jian <Jane.Jian@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-23drm/amd/display: Refactor construct_phy function in dc/link/link_factory.cSrinivasan Shanmugam
This commit modifies the construct_phy function to handle the case where `bios->integrated_info` is NULL and to address a compiler warning about a large stack allocation. Upon examination, it was found that the local `integrated_info` structure was just used to copy values which is large and was being declared directly on the stack which could potentially lead to performance issues. This commit changes the code to use `bios->integrated_info` directly, which avoids the need for a large stack allocation. The function now checks if `bios->integrated_info` is NULL before entering a for loop that uses it. If `bios->integrated_info` is NULL, the function skips the for loop and continues executing the rest of the code. This ensures that the function behaves correctly when `bios->integrated_info` is NULL and improves compatibility with dGPUs. Fixes the below with gcc W=1: drivers/gpu/drm/amd/amdgpu/../display/dc/link/link_factory.c: In function ‘construct_phy’: drivers/gpu/drm/amd/amdgpu/../display/dc/link/link_factory.c:743:1: warning: the frame size of 1056 bytes is larger than 1024 bytes [-Wframe-larger-than=] Cc: Wenjing Liu <wenjing.liu@amd.com> Cc: Jerry Zuo <jerry.zuo@amd.com> Cc: Qingqing Zhuo <qingqing.zhuo@amd.com> Cc: Tom Chung <chiahsuan.chung@amd.com> Cc: Alvin Lee <alvin.lee2@amd.com> Cc: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com> Cc: Roman Li <roman.li@amd.com> Cc: Hersen Wu <hersenxs.wu@amd.com> Cc: Alex Hung <alex.hung@amd.com> Cc: Aurabindo Pillai <aurabindo.pillai@amd.com> Cc: Harry Wentland <harry.wentland@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Suggested-by: Wenjing Liu <wenjing.liu@amd.com> Reviewed-by: Wenjing Liu <wenjing.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-23drm/amdgpu: correct hbm field in boot statusHawking Zhang
hbm filed takes bit 13 and bit 14 in boot status. Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-23drm/amdgpu: program device_cntl2 through pci cfg spaceFrank Min
device_cntl2 is accessible from pci config space, so program it through pci cfg space instead of mmio. Signed-off-by: Frank Min <Frank.Min@amd.com> Reviewed-by: Kenneth Feng <kenneth.feng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-23drm/amdgpu/atomfirmware: add intergrated info v2.3 tableLi Ma
[Why] The vram width value is 0. Because the integratedsysteminfo table in VBIOS has updated to 2.3. [How] Driver needs a new intergrated info v2.3 table too. Then the vram width value will be correct. Signed-off-by: Li Ma <li.ma@amd.com> Reviewed-by: Yifan Zhang <yifan1.zhang@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-23drm/amdgpu: Fix snprintf usage in amdgpu_gfx_kiq_init_ringSrinivasan Shanmugam
This commit fixes a format truncation issue arosed by the snprintf function potentially writing more characters into the ring->name buffer than it can hold, in the amdgpu_gfx_kiq_init_ring function The issue occurred because the '%d' format specifier could write between 1 and 10 bytes into a region of size between 0 and 8, depending on the values of xcc_id, ring->me, ring->pipe, and ring->queue. The snprintf function could output between 12 and 41 bytes into a destination of size 16, leading to potential truncation. To resolve this, the snprintf line was modified to use the '%hhu' format specifier for xcc_id, ring->me, ring->pipe, and ring->queue. The '%hhu' specifier is used for unsigned char variables and ensures that these values are printed as unsigned decimal integers. Fixes the below with gcc W=1: drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c: In function ‘amdgpu_gfx_kiq_init_ring’: drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c:332:61: warning: ‘%d’ directive output may be truncated writing between 1 and 10 bytes into a region of size between 0 and 8 [-Wformat-truncation=] 332 | snprintf(ring->name, sizeof(ring->name), "kiq_%d.%d.%d.%d", | ^~ drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c:332:50: note: directive argument in the range [0, 2147483647] 332 | snprintf(ring->name, sizeof(ring->name), "kiq_%d.%d.%d.%d", | ^~~~~~~~~~~~~~~~~ drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c:332:9: note: ‘snprintf’ output between 12 and 41 bytes into a destination of size 16 332 | snprintf(ring->name, sizeof(ring->name), "kiq_%d.%d.%d.%d", | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 333 | xcc_id, ring->me, ring->pipe, ring->queue); | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Fixes: 345a36c4f1ba ("drm/amdgpu: prefer snprintf over sprintf") Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Christian König <christian.koenig@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-23drm/amdgpu: fix invadate operation for pg_flagsJesse Zhang
Since the type of pg_flags is u32, adev->pg_flags >> 16 >> 16 is 0 regardless of the values of its operands. So removing the operations upper_32_bits and lower_32_bits. Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Suggested-by: Tim Huang <Tim.Huang@amd.com> Reviewed-by: Tim Huang <Tim.Huang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-23drm/amdgpu/mes12: mes hw_fini fix for mode1 resetJack Xiao
Port mes11 hw_fini to mes12, fix for mode1 reset. Signed-off-by: Jack Xiao <Jack.Xiao@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-23drm/amdgpu: fix invadate operation for umschJesse Zhang
Since the type of data_size is uint32_t, adev->umsch_mm.data_size - 1 >> 16 >> 16 is 0 regardless of the values of its operands So removing the operations upper_32_bits and lower_32_bits. Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Suggested-by: Tim Huang <Tim.Huang@amd.com> Reviewed-by: Tim Huang <Tim.Huang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-23Merge tag 'arm64-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 fixes from Will Deacon: "The major fix here is for a filesystem corruption issue reported on Apple M1 as a result of buggy management of the floating point register state introduced in 6.8. I initially reverted one of the offending patches, but in the end Ard cooked a proper fix so there's a revert+reapply in the series. Aside from that, we've got some CPU errata workarounds and misc other fixes. - Fix broken FP register state tracking which resulted in filesystem corruption when dm-crypt is used - Workarounds for Arm CPU errata affecting the SSBS Spectre mitigation - Fix lockdep assertion in DMC620 memory controller PMU driver - Fix alignment of BUG table when CONFIG_DEBUG_BUGVERBOSE is disabled" * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: arm64/fpsimd: Avoid erroneous elide of user state reload Reapply "arm64: fpsimd: Implement lazy restore for kernel mode FPSIMD" arm64: asm-bug: Add .align 2 to the end of __BUG_ENTRY perf/arm-dmc620: Fix lockdep assert in ->event_init() Revert "arm64: fpsimd: Implement lazy restore for kernel mode FPSIMD" arm64: errata: Add workaround for Arm errata 3194386 and 3312417 arm64: cputype: Add Neoverse-V3 definitions arm64: cputype: Add Cortex-X4 definitions arm64: barrier: Restore spec_bar() macro
2024-05-23drm/admgpu: fix dereferencing null pointer contextJesse Zhang
When user space sets an invalid ta type, the pointer context will be empty. So it need to check the pointer context before using it Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Suggested-by: Tim Huang <Tim.Huang@amd.com> Reviewed-by: Tim Huang <Tim.Huang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-23drm/amd/pm: fix unsigned value asic_type compared againstJesse Zhang
Enum asic_type always greater than or equal CHIP_TAHITI. Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com> Suggested-by: Tim Huang <Tim.Huang@amd.com> Reviewed-by: Tim Huang <Tim.Huang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-23drm/amdgpu: skip to create ras xxx_err_count node when ACA is enabledYang Wang
skip to create 'xxx_err_count' node when ACA is enabled. Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-23Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds
Pull virtio updates from Michael Tsirkin: "Several new features here: - virtio-net is finally supported in vduse - virtio (balloon and mem) interaction with suspend is improved - vhost-scsi now handles signals better/faster And fixes, cleanups all over the place" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (48 commits) virtio-pci: Check if is_avq is NULL virtio: delete vq in vp_find_vqs_msix() when request_irq() fails MAINTAINERS: add Eugenio Pérez as reviewer vhost-vdpa: Remove usage of the deprecated ida_simple_xx() API vp_vdpa: don't allocate unused msix vectors sound: virtio: drop owner assignment fuse: virtio: drop owner assignment scsi: virtio: drop owner assignment rpmsg: virtio: drop owner assignment nvdimm: virtio_pmem: drop owner assignment wifi: mac80211_hwsim: drop owner assignment vsock/virtio: drop owner assignment net: 9p: virtio: drop owner assignment net: virtio: drop owner assignment net: caif: virtio: drop owner assignment misc: nsm: drop owner assignment iommu: virtio: drop owner assignment drm/virtio: drop owner assignment gpio: virtio: drop owner assignment firmware: arm_scmi: virtio: drop owner assignment ...
2024-05-23irqchip/riscv-imsic: Fixup riscv_ipi_set_virq_range() conflictPalmer Dabbelt
There was a semantic conflict between 21a8f8a0eb35 ("irqchip: Add RISC-V incoming MSI controller early driver") and dc892fb44322 ("riscv: Use IPIs for remote cache/TLB flushes by default") due to an API change. This manifests as a build failure post-merge. Fixes: 0bfbc914d943 ("Merge tag 'riscv-for-linus-6.10-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux") Reported-by: Tomasz Jeznach <tjeznach@rivosinc.com> Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240522184953.28531-3-palmer@rivosinc.com Link: https://lore.kernel.org/all/mhng-10b71228-cf3e-42ca-9abf-5464b15093f1@palmer-ri-x1c9/
2024-05-23drm/xe/guc: Port over the slow GuC loading support from i915John Harrison
GuC loading can take longer than it is supposed to for various reasons. So add in the code to cope with that and to report it when it happens. There are also many different reasons why GuC loading can fail, so add in the code for checking for those and for reporting issues in a meaningful manner rather than just hitting a timeout and saying 'fail: status = %x'. Also, remove the 'FIXME' comment about an i915 bug that has never been applicable to Xe! v2: Actually report the requested and granted frequencies rather than showing granted twice (review feedback from Badal). v3: Locally code all the timeout and end condition handling because a helper function is not allowed (review feedback from Lucas/Rodrigo). v4: Add more documentation comments and rename a define to add units (review feedback from Lucas). v5: Fix copy/paste error in xe_mmio_wait32_not (review feedback from Lucas) and rebase (no more return value from guc_wait_ucode). Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240518043700.3264362-3-John.C.Harrison@Intel.com
2024-05-23drm/xe: Make read_perf_limit_reasons globally accessibleJohn Harrison
Other driver code beyond the sysfs interface wants to know about throttling. So make the query function globally accessible. v2: Revert include order change (review feedback from Lucas) v3: Remove '_sysfs' from throttle file names and keep limit query in the same file rather than moving elsewhere (review feedback from Rodrigo). v4: Correct #include while renaming header file (review feedback from Lucas). Signed-off-by: John Harrison <John.C.Harrison@Intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240518043700.3264362-2-John.C.Harrison@Intel.com
2024-05-23drm/xe: Nuke simple error captureJosé Roberto de Souza
This error capture prints into dmesg HW state when a gpu hang happens. It was useful when we did not had devcoredump, now it is a incompleted version of devcoredump that has potential to flood dmesg. Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: John Harrison <John.C.Harrison@Intel.com> Signed-off-by: José Roberto de Souza <jose.souza@intel.com> Reviewed-by: John Harrison <John.C.Harrison@Intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240522203431.191594-1-jose.souza@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-05-23drm/xe: Add process name to devcoredumpJosé Roberto de Souza
Process name help us track what application caused the gpug hang, this is crucial when running several applications at the same time. v2: - handle Xe KMD exec_queues without VM v3: - use get_pid_task() (suggested by Nirmoy) Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Nirmoy Das <nirmoy.das@intel.com> Signed-off-by: José Roberto de Souza <jose.souza@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240522201203.145403-1-jose.souza@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-05-23drm/xe: remove unused struct 'xe_gt_desc'Dr. David Alan Gilbert
'xe_gt_desc' is unused since commit 1e6c20be6c83 ("drm/xe: Drop extra_gts[] declarations and XE_GT_TYPE_REMOTE"). Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Link: https://patchwork.freedesktop.org/patch/msgid/20240522175840.382107-1-linux@treblig.org Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-05-23KVM: SVM: WARN on vNMI + NMI window iff NMIs are outright maskedSean Christopherson
When requesting an NMI window, WARN on vNMI support being enabled if and only if NMIs are actually masked, i.e. if the vCPU is already handling an NMI. KVM's ABI for NMIs that arrive simultanesouly (from KVM's point of view) is to inject one NMI and pend the other. When using vNMI, KVM pends the second NMI simply by setting V_NMI_PENDING, and lets the CPU do the rest (hardware automatically sets V_NMI_BLOCKING when an NMI is injected). However, if KVM can't immediately inject an NMI, e.g. because the vCPU is in an STI shadow or is running with GIF=0, then KVM will request an NMI window and trigger the WARN (but still function correctly). Whether or not the GIF=0 case makes sense is debatable, as the intent of KVM's behavior is to provide functionality that is as close to real hardware as possible. E.g. if two NMIs are sent in quick succession, the probability of both NMIs arriving in an STI shadow is infinitesimally low on real hardware, but significantly larger in a virtual environment, e.g. if the vCPU is preempted in the STI shadow. For GIF=0, the argument isn't as clear cut, because the window where two NMIs can collide is much larger in bare metal (though still small). That said, KVM should not have divergent behavior for the GIF=0 case based on whether or not vNMI support is enabled. And KVM has allowed simultaneous NMIs with GIF=0 for over a decade, since commit 7460fb4a3400 ("KVM: Fix simultaneous NMIs"). I.e. KVM's GIF=0 handling shouldn't be modified without a *really* good reason to do so, and if KVM's behavior were to be modified, it should be done irrespective of vNMI support. Fixes: fa4c027a7956 ("KVM: x86: Add support for SVM's Virtual NMI") Cc: stable@vger.kernel.org Cc: Santosh Shukla <Santosh.Shukla@amd.com> Cc: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240522021435.1684366-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-23KVM: x86: Force KVM_WERROR if the global WERROR is enabledSean Christopherson
Force KVM_WERROR if the global WERROR is enabled to avoid pestering the user about a Kconfig that will ultimately be ignored. Force KVM_WERROR instead of making it mutually exclusive with WERROR to avoid generating a .config builds KVM with -Werror, but has KVM_WERROR=n. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240517180341.974251-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-23KVM: x86: Disable KVM_INTEL_PROVE_VE by defaultSean Christopherson
Disable KVM's "prove #VE" support by default, as it provides no functional value, and even its sanity checking benefits are relatively limited. I.e. it should be fully opt-in even on debug kernels, especially since EPT Violation #VE suppression appears to be buggy on some CPUs. Opportunistically add a line in the help text to make it abundantly clear that KVM_INTEL_PROVE_VE should never be enabled in a production environment. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240518000430.1118488-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-23KVM: VMX: Enumerate EPT Violation #VE support in /proc/cpuinfoSean Christopherson
Don't suppress printing EPT_VIOLATION_VE in /proc/cpuinfo, knowing whether or not KVM_INTEL_PROVE_VE actually does anything is extremely valuable. A privileged user can get at the information by reading the raw MSR, but the whole point of the VMX flags is to avoid needing to glean information from raw MSR reads. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240518000430.1118488-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-23KVM: x86/mmu: Print SPTEs on unexpected #VESean Christopherson
Print the SPTEs that correspond to the faulting GPA on an unexpected EPT Violation #VE to help the user debug failures, e.g. to pinpoint which SPTE didn't have SUPPRESS_VE set. Opportunistically assert that the underlying exit reason was indeed an EPT Violation, as the CPU has *really* gone off the rails if a #VE occurs due to a completely unexpected exit reason. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240518000430.1118488-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-23KVM: VMX: Dump VMCS on unexpected #VESean Christopherson
Dump the VMCS on an unexpected #VE, otherwise it's practically impossible to figure out why the #VE occurred. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240518000430.1118488-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-23KVM: x86/mmu: Add sanity checks that KVM doesn't create EPT #VE SPTEsSean Christopherson
Assert that KVM doesn't set a SPTE to a value that could trigger an EPT Violation #VE on a non-MMIO SPTE, e.g. to help detect bugs even without KVM_INTEL_PROVE_VE enabled, and to help debug actual #VE failures. Note, this will run afoul of TDX support, which needs to reflect emulated MMIO accesses into the guest as #VEs (which was the whole point of adding EPT Violation #VE support in KVM). The obvious fix for that is to exempt MMIO SPTEs, but that's annoyingly difficult now that is_mmio_spte() relies on a per-VM value. However, resolving that conundrum is a future problem, whereas getting KVM_INTEL_PROVE_VE healthy is a current problem. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240518000430.1118488-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-23KVM: nVMX: Always handle #VEs in L0 (never forward #VEs from L2 to L1)Sean Christopherson
Always handle #VEs, e.g. due to prove EPT Violation #VE failures, in L0, as KVM does not expose any #VE capabilities to L1, i.e. any and all #VEs are KVM's responsibility. Fixes: 8131cf5b4fd8 ("KVM: VMX: Introduce test mode related to EPT violation VE") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240518000430.1118488-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-23KVM: nVMX: Initialize #VE info page for vmcs02 when proving #VE supportSean Christopherson
Point vmcs02.VE_INFORMATION_ADDRESS at the vCPU's #VE info page when initializing vmcs02, otherwise KVM will run L2 with EPT Violation #VE enabled and a VE info address pointing at pfn 0. Fixes: 8131cf5b4fd8 ("KVM: VMX: Introduce test mode related to EPT violation VE") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240518000430.1118488-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-23KVM: VMX: Don't kill the VM on an unexpected #VESean Christopherson
Don't terminate the VM on an unexpected #VE, as it's extremely unlikely the #VE is fatal to the guest, and even less likely that it presents a danger to the host. Simply resume the guest on "failure", as the #VE info page's BUSY field will prevent converting any more EPT Violations to #VEs for the vCPU (at least, that's what the BUSY field is supposed to do). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240518000430.1118488-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-23KVM: x86/mmu: Use SHADOW_NONPRESENT_VALUE for atomic zap in TDP MMUIsaku Yamahata
Use SHADOW_NONPRESENT_VALUE when zapping TDP MMU SPTEs with mmu_lock held for read, tdp_mmu_zap_spte_atomic() was simply missed during the initial development. Fixes: 7f01cab84928 ("KVM: x86/mmu: Allow non-zero value for non-present SPTE and removed SPTE") Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> [sean: write changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20240518000430.1118488-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-23drm/xe: Enable D3Cold on 'low' VRAM utilizationRodrigo Vivi
Now that we eliminated all the mem_access get/put with its locking issues from the inner calls of migration, we can allow D3Cold. Enable it when VRAM utilization is lower then 300Mb. On higher utilization we only allow D3hot so we don't increase so much the latency on runtime resume due to the memory restoration. Reviewed-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Anshuman Gupta <anshuman.gupta@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240522170105.327472-7-rodrigo.vivi@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-05-23drm/xe: Stop checking for power_lost on D3ColdRodrigo Vivi
GuC reset status is not reliable for this purpose and it is once in a while ending up in a situation of D3Cold, where power_reset is false and without the proper memory restoration the GuC reload and Display will fail to come back from D3Cold. So, let's do a full restoration of everything if we have a risk of losing power, without further optimizations. v2: also remove the gut_in_reset function (Anshuman) Cc: Anshuman Gupta <anshuman.gupta@intel.com> Reviewed-by: Anshuman Gupta <anshuman.gupta@intel.com> Reviewed-by: Badal Nilawar <badal.nilawar@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240522170105.327472-6-rodrigo.vivi@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-05-23drm/xe: Prepare display for D3ColdRodrigo Vivi
Prepare power-well and DC handling for a full power lost during D3Cold, then sanitize it upon D3->D0. Otherwise we get a bunch of state mismatch. Ideally we could leave DC9 enabled and wouldn't need to move DC9->DC0 on every runtime resume, however, the disable_DC is part of the power-well checks and intrinsic to the dc_off power well. In the future that can be detangled so we can have even bigger power savings. But for now, let's focus on getting a D3Cold, which saves much more power by itself. v2: create new functions to avoid full-suspend-resume path, which would result in a deadlock between xe_gem_fault and the modeset-ioctl. v3: Only avoid the full modeset to avoid the race, for a more robust suspend-resume. Cc: Anshuman Gupta <anshuman.gupta@intel.com> Cc: Uma Shankar <uma.shankar@intel.com> Tested-by: Francois Dugast <francois.dugast@intel.com> Reviewed-by: Anshuman Gupta <anshuman.gupta@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240522170105.327472-5-rodrigo.vivi@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-05-23drm/xe: Relax runtime pm protection around VMRodrigo Vivi
In the regular use case scenario, user space will create a VM, and keep it alive for the entire duration of its workload. For the regular desktop cases, it means that the VM is alive even on idle scenarios where display goes off. This is unacceptable since this would entirely block runtime PM indefinitely, blocking deeper Package-C state. This would be a waste drainage of power. Limit the VM protection solely for long-running workloads that are not protected by the scheduler references. By design, run_job for long-running workloads returns NULL and the scheduler drops all the references of it, hence protecting the VM for this case is necessary. v2: Update commit message to a more imperative language and to reflect why the VM protection is really needed. Also add a comment in the code to let the reason visbible. v3: Remove vma_access case and the mentions to mmap. Mmap cases are already protected by the gem page fault. Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Tested-by: Francois Dugast <francois.dugast@intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240522170105.327472-4-rodrigo.vivi@intel.com Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>