linux-arm.git - Russell King's ARM Linux kernel tree

Age	Commit message (Collapse)	Author
2024-06-27	drm/amdgpu: process RAS fatal error MB notification	Vignesh Chander
	For RAS error scenario, VF guest driver will check mailbox and set fed flag to avoid unnecessary HW accesses. additionally, poll for reset completion message first to avoid accidentally spamming multiple reset requests to host. v2: add another mailbox check for handling case where kfd detects timeout first v3: set host_flr bit and use wait_for_reset Signed-off-by: Vignesh Chander <Vignesh.Chander@amd.com> Reviewed-by: Zhigang Luo <Zhigang.Luo@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-06-27	drm/amdgpu: Use dev_ prints for virtualization as it supports multi adapter	Vignesh Chander
	So we can get clearer per device logging. Signed-off-by: Vignesh Chander <Vignesh.Chander@amd.com> Reviewed-by: Zhigang Luo <Zhigang.Luo@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-06-14	drm/amdgpu: fix sriov host flr handler	Yunxiang Li
	We send back the ready to reset message before we stop anything. This is wrong. Move it to when we are actually ready for the FLR to happen. In the current state since we take tens of seconds to stop everything, it is very likely that host would give up waiting and reset the GPU before we send ready, so it would be the same as before. But this gets rid of the hack with reset_domain locking and also let us tell how slow ready to reset actually is from the host. The ready to reset speed can be improved later. Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Reviewed-by: Emily Deng <Emily.Deng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-02	drm/amdgpu: Add reset_context flag for host FLR	Yunxiang Li
	There are other reset sources that pass NULL as the job pointer, such as amdgpu_amdkfd_reset_work. Therefore, using the job pointer to check if the FLR comes from the host does not work. Add a flag in reset_context to explicitly mark host triggered reset, and set this flag when we receive host reset notification. Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com> Reviewed-by: Emily Deng <Emily.Deng@amd.com> Reviewed-by: Zhigang Luo <zhigang.luo@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-02	drm/amdgpu: Fix two reset triggered in a row	Yunxiang Li
	Some times a hang GPU causes multiple reset sources to schedule resets. The second source will be able to trigger an unnecessary reset if they schedule after we call amdgpu_device_stop_pending_resets. Move amdgpu_device_stop_pending_resets to after the reset is done. Since at this point the GPU is supposedly in a good state, any reset scheduled after this point would be a legitimate reset. Remove unnecessary and incorrect checks for amdgpu_in_reset that was kinda serving this purpose. Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20	drm/amdgpu: trigger flr_work if reading pf2vf data failed	Zhigang Luo
	if reading pf2vf data failed 30 times continuously, it means something is wrong. Need to trigger flr_work to recover the issue. also use dev_err to print the error message to get which device has issue and add warning message if waiting IDH_FLR_NOTIFICATION_CMPL timeout. Signed-off-by: Zhigang Luo <Zhigang.Luo@amd.com> Acked-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-25	drm/amdgpu: Support passing poison consumption ras block to SRIOV	YiPeng Chai
	Support passing poison consumption ras blocks to SRIOV. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-12-15	drm/amdgpu: add RAS poison consumption handler for AI SRIOV	Tao Zhou
	Send message to host and host will handle it. v2: split the patch into two parts, one is for mxgpu ai and another one is for common poison consumption handler. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-18	Revert "drm/amdgpu: let mode2 reset fallback to default when failure"	Victor Zhao
	This reverts commit dac6b80818ac2353631c5a33d140d8d5508e2957. This commit reverted the AMDGPU_SKIP_MODE2_RESET as it conflicts with the original design of reset handler. Will redesign it. Fixes: dac6b80818ac23 ("drm/amdgpu: let mode2 reset fallback to default when failure") Signed-off-by: Victor Zhao <Victor.Zhao@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-08-16	drm/amdgpu: let mode2 reset fallback to default when failure	Victor Zhao
	- introduce AMDGPU_SKIP_MODE2_RESET flag - let mode2 reset fallback to default reset method if failed v2: move this part out from the asic specific part Signed-off-by: Victor Zhao <Victor.Zhao@amd.com> Acked-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-07-13	drm/amdgpu: support reset flag set for gpu reset	Likun Gao
	Move reset_context out of gpu recover function to make it configurable for different reset purpose. For the reset way of call gpu_recovery sysfs, force to use full reset method. Otherwise, try soft reset by default if the related ASIC supportted, if soft reset failed, will use full reset. Signed-off-by: Likun Gao <Likun.Gao@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-10	drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to ↵	Andrey Grodzovsky
	amdgpu_device_gpu_recover We removed the wrapper that was queueing the recover function into reset domain queue who was using this name. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-09	drm/amdgpu: Move in_gpu_reset into reset_domain	Andrey Grodzovsky
	We should have a single instance per entrire reset domain. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Suggested-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://www.spinics.net/lists/amd-gfx/msg74116.html
2022-02-09	drm/amdgpu: Move reset sem into reset_domain	Andrey Grodzovsky
	We want single instance of reset sem across all reset clients because in case of XGMI we should stop access cross device MMIO because any of them could be in a reset in the moment. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://www.spinics.net/lists/amd-gfx/msg74117.html
2022-02-09	drm/amdgpu: Rework reset domain to be refcounted.	Andrey Grodzovsky
	The reset domain contains register access semaphor now and so needs to be present as long as each device in a hive needs it and so it cannot be binded to XGMI hive life cycle. Adress this by making reset domain refcounted and pointed by each member of the hive and the hive itself. v4: Fix crash on boot witrh XGMI hive by adding type to reset_domain. XGMI will only create a new reset_domain if prevoius was of single device type meaning it's first boot. Otherwsie it will take a refocunt to exsiting reset_domain from the amdgou device. Add a wrapper around reset_domain->refcount get/put and a wrapper around send to reset wq (Lijo) Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Link: https://www.spinics.net/lists/amd-gfx/msg74121.html
2022-02-09	drm/amd/virt: For SRIOV send GPU reset directly to TDR queue.	Andrey Grodzovsky
	No need to to trigger another work queue inside the work queue. v3: Problem: Extra reset caused by host side FLR notification following guest side triggered reset. Fix: Preven qeuing flr_work from mailbox irq if guest already executing a reset. Suggested-by: Liu Shaoyun <Shaoyun.Liu@amd.com> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Liu Shaoyun <Shaoyun.Liu@amd.com> Link: https://www.spinics.net/lists/amd-gfx/msg74114.html
2022-01-07	drm/amdgpu: add dummy event6 for vega10	James Yao
	[why] Malicious mailbox event1 fails driver loading on vega10. A dummy event6 prevent driver from taking response from malicious event1 as its own. [how] On vega10, send a mailbox event6 before sending event1. Signed-off-by: James Yao <yiqing.yao@amd.com> Reviewed-by: Jingwen Chen <Jingwen.Chen2@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-14	drm/amdgpu: SRIOV flr_work should use down_write	Victor Skvortsov
	Host initiated VF FLR may fail if someone else is already holding a read_lock. Change from down_write_trylock to down_write to guarantee the reset goes through. Signed-off-by: Victor Skvortsov <victor.skvortsov@amd.com> Reviewed by: Shaoyun.liu <Shaoyun.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-08-30	drm/amd/amdgpu: Add ready_to_reset resp for vega10	YuBiao Wang
	Send response to host after received the flr notification from host. Port NV change to vega10. Signed-off-by: YuBiao Wang <YuBiao.Wang@amd.com> Reviewed-by: Jingwen Chen <Jingwen.Chen2@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-07-13	drm/amdgpu: SRIOV flr_work should take write_lock	Jingwen Chen
	[Why] If flr_work takes read_lock, then other threads who takes read_lock can access hardware when host is doing vf flr. [How] flr_work should take write_lock to avoid this case. Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com> Reviewed-by: Monk Liu <monk.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-01-13	drm/amdgpu/sriov Stop data exchange for wholegpu reset	Jack Zhang
	[Why] When host trigger a whole gpu reset, guest will keep waiting till host finish reset. But there's a work queue in guest exchanging data between vf&pf which need to access frame buffer. During whole gpu reset, frame buffer is not accessable, and this causes the call trace. [How] After vf get reset notification from pf, stop data exchange. Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com> Signed-off-by: Jack Zhang <Jack.Zhang1@amd.com> Reviewed-by: Monk Liu <monk.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-12-15	drm/amdgpu/SRIOV: Extend VF reset request wait period	Jiange Zhao
	In Virtualization case, when one VF is sending too many FLR requests, hypervisor would stop responding to this VF's request for a long period of time. This is called event guard. During this period of cooling time, guest driver should wait instead of doing other things. After this period of time, guest driver would resume reset process and return to normal. Currently, guest driver would wait 12 seconds and return fail if it doesn't get response from host. Solution: extend this waiting time in guest driver and poll response periodically. Poll happens every 6 seconds and it will last for 60 seconds. v2: change the max repetition times from number to macro. Signed-off-by: Jiange Zhao <Jiange.Zhao@amd.com> Acked-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-09-15	drm/amdgpu: Do gpu recovery when no job is running	Liu ChengZhe
	In function flr_work, we should do gpu recovery when no job is running. Fix the logic by inverting it. v2: modify the description Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Liu ChengZhe <ChengZhe.Liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-24	drm/amdgpu: change reset lock from mutex to rw_semaphore	Dennis Li
	clients don't need reset-lock for synchronization when no GPU recovery. v2: change to return the return value of down_read_killable. v3: if GPU recovery begin, VF ignore FLR notification. Reviewed-by: Monk Liu <monk.liu@amd.com> Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Dennis Li <Dennis.Li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-24	drm/amdgpu: refine codes to avoid reentering GPU recovery	Dennis Li
	if other threads have holden the reset lock, recovery will fail to try_lock. Therefore we introduce atomic hive->in_reset and adev->in_gpu_reset, to avoid reentering GPU recovery. v2: drop "? true : false" in the definition of amdgpu_in_reset Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Dennis Li <Dennis.Li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-18	drm/amdgpu: Fix repeatly flr issue	jqdeng
	Only for no job running test case need to do recover in flr notification. For having job in mirror list, then let guest driver to hit job timeout, and then do recover. Signed-off-by: jqdeng <Emily.Deng@amd.com> Acked-by: Nirmoy Das <nirmoy.das@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-08-14	drm/amdgpu: revert "fix system hang issue during GPU reset"	Christian König
	The whole approach wasn't thought through till the end. We already had a reset lock like this in the past and it caused the same problems like this one. Completely revert the patch for now and add individual trylock protection to the hardware access functions as necessary. This reverts commit df9c8d1aa278c435c30a69b8f2418b4a52fcb929. Signed-off-by: Christian König <christian.koenig@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2020-07-27	drm/amdgpu: fix system hang issue during GPU reset	Dennis Li
	when GPU hang, driver has multi-paths to enter amdgpu_device_gpu_recover, the atomic adev->in_gpu_reset and hive->in_reset are used to avoid re-entering GPU recovery. During GPU reset and resume, it is unsafe that other threads access GPU, which maybe cause GPU reset failed. Therefore the new rw_semaphore adev->reset_sem is introduced, which protect GPU from being accessed by external threads during recovery. v2: 1. add rwlock for some ioctls, debugfs and file-close function. 2. change to use dqm->is_resetting and dqm_lock for protection in kfd driver. 3. remove try_lock and change adev->in_gpu_reset as atomic, to avoid re-enter GPU recovery for the same GPU hang. v3: 1. change back to use adev->reset_sem to protect kfd callback functions, because dqm_lock couldn't protect all codes, for example: free_mqd must be called outside of dqm_lock; [ 1230.176199] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 05/23/2019 [ 1230.177221] Call Trace: [ 1230.178249] dump_stack+0x98/0xd5 [ 1230.179443] amdgpu_virt_kiq_reg_write_reg_wait+0x181/0x190 [amdgpu] [ 1230.180673] gmc_v9_0_flush_gpu_tlb+0xcc/0x310 [amdgpu] [ 1230.181882] amdgpu_gart_unbind+0xa9/0xe0 [amdgpu] [ 1230.183098] amdgpu_ttm_backend_unbind+0x46/0x180 [amdgpu] [ 1230.184239] ? ttm_bo_put+0x171/0x5f0 [ttm] [ 1230.185394] ttm_tt_unbind+0x21/0x40 [ttm] [ 1230.186558] ttm_tt_destroy.part.12+0x12/0x60 [ttm] [ 1230.187707] ttm_tt_destroy+0x13/0x20 [ttm] [ 1230.188832] ttm_bo_cleanup_memtype_use+0x36/0x80 [ttm] [ 1230.189979] ttm_bo_put+0x1be/0x5f0 [ttm] [ 1230.191230] amdgpu_bo_unref+0x1e/0x30 [amdgpu] [ 1230.192522] amdgpu_amdkfd_free_gtt_mem+0xaf/0x140 [amdgpu] [ 1230.193833] free_mqd+0x25/0x40 [amdgpu] [ 1230.195143] destroy_queue_cpsch+0x1a7/0x270 [amdgpu] [ 1230.196475] pqm_destroy_queue+0x105/0x260 [amdgpu] [ 1230.197819] kfd_ioctl_destroy_queue+0x37/0x70 [amdgpu] [ 1230.199154] kfd_ioctl+0x277/0x500 [amdgpu] [ 1230.200458] ? kfd_ioctl_get_clock_counters+0x60/0x60 [amdgpu] [ 1230.201656] ? tomoyo_file_ioctl+0x19/0x20 [ 1230.202831] ksys_ioctl+0x98/0xb0 [ 1230.204004] __x64_sys_ioctl+0x1a/0x20 [ 1230.205174] do_syscall_64+0x5f/0x250 [ 1230.206339] entry_SYSCALL_64_after_hwframe+0x49/0xbe 2. remove try_lock and introduce atomic hive->in_reset, to avoid re-enter GPU recovery. v4: 1. remove an unnecessary whitespace change in kfd_chardev.c 2. remove comment codes in amdgpu_device.c 3. add more detailed comment in commit message 4. define a wrap function amdgpu_in_reset v5: 1. Fix some style issues. Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Suggested-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Suggested-by: Christian König <christian.koenig@amd.com> Suggested-by: Felix Kuehling <Felix.Kuehling@amd.com> Suggested-by: Lijo Lazar <Lijo.Lazar@amd.com> Suggested-by: Luben Tukov <luben.tuikov@amd.com> Signed-off-by: Dennis Li <Dennis.Li@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-12-23	drm/amdgpu: use true, false for bool variable in mxgpu_ai.c	zhengbin
	Fixes coccicheck warning: drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c:253:2-20: WARNING: Assignment of 0/1 to bool variable drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c:265:2-20: WARNING: Assignment of 0/1 to bool variable Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: zhengbin <zhengbin13@huawei.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-12-11	drm/amd/powerplay: enable pp one vf mode for vega10	Yintian Tao
	Originally, due to the restriction from PSP and SMU, VF has to send message to hypervisor driver to handle powerplay change which is complicated and redundant. Currently, SMU and PSP can support VF to directly handle powerplay change by itself. Therefore, the old code about the handshake between VF and PF to handle powerplay will be removed and VF will use new the registers below to handshake with SMU. mmMP1_SMN_C2PMSG_101: register to handle SMU message mmMP1_SMN_C2PMSG_102: register to handle SMU parameter mmMP1_SMN_C2PMSG_103: register to handle SMU response v2: remove module parameter pp_one_vf v3: fix the parens v4: forbid vf to change smu feature v5: use hwmon_attributes_visible to skip sepicified hwmon atrribute v6: change skip condition at vega10_copy_table_to_smc Signed-off-by: Yintian Tao <yttao@amd.com> Acked-by: Evan Quan <evan.quan@amd.com> Reviewed-by: Kenneth Feng <kenneth.feng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-08-02	drm/amdgpu: cleanup vega10 SRIOV code path	Monk Liu
	we can simplify all those unnecessary function under SRIOV for vega10 since: 1) PSP L1 policy is by force enabled in SRIOV 2) original logic always set all flags which make itself a dummy step besides, 1) the ih_doorbell_range set should also be skipped for VEGA10 SRIOV. 2) the gfx_common registers should also be skipped for VEGA10 SRIOV. Signed-off-by: Monk Liu <Monk.Liu@amd.com> Reviewed-by: Emily Deng <Emily.Deng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-06-11	drm/amdgpu: Hardcode reg access using L1 security	Trigger Huang
	Under Vega10 SR-IOV VF, L1 register access mode should be enabled by default as the non-security VF will no longer be supported. Signed-off-by: Trigger Huang <Trigger.Huang@amd.com> Reviewed-by: Emily Deng <Emily.Deng@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-05-24	drm/amdgpu: init vega10 SR-IOV reg access mode	Trigger Huang
	Set different register access mode according to the features provided by firmware Signed-off-by: Trigger Huang <Trigger.Huang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-05-24	drm/amdgpu: enable separate timeout setting for every ring type V4	Evan Quan
	Every ring type can have its own timeout setting. - V2: update lockup_timeout parameter format and cosmetic fixes - V3: invalidate 0 and negative values - V4: update lockup_timeout parameter format Signed-off-by: Evan Quan <evan.quan@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-05-06	drm/amdgpu: Add IDH_QUERY_ALIVE event for SR-IOV	Trigger Huang
	SR-IOV host side will send IDH_QUERY_ALIVE to guest VM to check if this guest VM is still alive (not destroyed). The only thing guest KMD need to do is to send ACK back to host. Signed-off-by: Trigger Huang <Trigger.Huang@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-04-10	drm/amdgpu: support dpm level modification under virtualization v3	Yintian Tao
	Under vega10 virtualuzation, smu ip block will not be added. Therefore, we need add pp clk query and force dpm level function at amdgpu_virt_ops to support the feature. v2: add get_pp_clk existence check and use kzalloc to allocate buf v3: return -ENOMEM for allocation failure and correct the coding style Signed-off-by: Yintian Tao <yttao@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-02-13	drm/amdgpu: tighten gpu_recover in mailbox_flr to avoid duplicate recover in ↵	wentalou
	sriov sriov's gpu_recover inside xgpu_ai_mailbox_flr_work would cause duplicate recover in TDR. TDR's gpu_recover would be triggered by amdgpu_job_timedout, that could avoid vk-cts failure by unexpected recover. Signed-off-by: Wentao Lou <Wentao.Lou@amd.com> Acked-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2019-01-14	drm/amdgpu/sriov:Correct pfvf exchange logic	Emily Deng
	The pfvf exchange need be in exclusive mode. And add pfvf exchange in gpu reset. Signed-off-by: Emily Deng <Emily.Deng@amd.com> Reviewed-By: Xiangliang Yu <Xiangliang.Yu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2018-08-27	drm/amdgpu: cleanup GPU recovery check a bit (v2)	Christian König
	Check if we should call the function instead of providing the forced flag. v2: rebase on KFD changes (Alex) Signed-off-by: Christian König <christian.koenig@amd.com> Acked-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Huang Rui <ray.huang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2018-05-15	drm/amdgpu/sriov: Need to set in_gpu_reset flag to back after gpu reset	Emily Deng
	After host os reset gpu reset, need to set flag in_gpu_reset to zero. Signed-off-by: Emily Deng <Emily.Deng@amd.com> Reviewed-by: Monk Liu <monk.liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2018-03-22	drm/amdgpu: fix spelling mistake: "asssert" -> "assert"	Colin Ian King
	Trivial fix to spelling mistake in pr_err error message text Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2018-03-14	drm/amdgpu: Move IH clientid defs to separate file	Oak Zeng
	This is preparation for sharing client ID definitions between amdgpu and amdkfd Signed-off-by: Oak Zeng <Oak.Zeng@amd.com> Reviewed-by: Chunming Zhou <david1.zhou@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2018-03-14	drm/amdgpu: refactoring mailbox to fix TDR handshake bugs(v2)	Monk Liu
	this patch actually refactor mailbox implmentations, and all below changes are needed together to fix all those mailbox handshake issues exposured by heavey TDR test. 1)refactor all mailbox functions based on byte accessing for mb_control reason is to avoid touching non-related bits when writing trn/rcv part of mailbox_control, this way some incorrect INTR sent to hypervisor side could be avoided, and it fixes couple handshake bug. 2)trans_msg function re-impled: put a invalid logic before transmitting message to make sure the ACK bit is in a clear status, otherwise there is chance that ACK asserted already before transmitting message and lead to fake ACK polling. (hypervisor side have some tricks to workaround ACK bit being corrupted by VF FLR which hase an side effects that may make guest side ACK bit asserted wrongly), and clear TRANS_MSG words after message transferred. 3)for mailbox_flr_work, it is also re-worked: it takes the mutex lock first if invoked, to block gpu recover's participate too early while hypervisor side is doing VF FLR. (hypervisor sends FLR_NOTIFY to guest before doing VF FLR and sentds FLR_COMPLETE after VF FLR done, and the FLR_NOTIFY will trigger interrupt to guest which lead to mailbox_flr_work being invoked) This can avoid the issue that mailbox trans msg being cleared by its VF FLR. 4)for mailbox_rcv_irq IRQ routine, it should only peek msg and schedule mailbox_flr_work, instead of ACK to hypervisor itself, because FLR_NOTIFY msg sent from hypervisor side doesn't need VF's ACK (this is because VF's ACK would lead to hypervisor clear its trans_valid/msg, and this would cause handshake bug if trans_valid/msg is cleared not due to correct VF ACK but from a wrong VF ACK like this "FLR_NOTIFY" one) This fixed handshake bug that sometimes GUEST always couldn't receive "READY_TO_ACCESS_GPU" msg from hypervisor. 5)seperate polling time limite accordingly: POLL ACK cost no more than 500ms POLL MSG cost no more than 12000ms POLL FLR finish cost no more than 500ms 6) we still need to set adev into in_gpu_reset mode after we received FLR_NOTIFY from host side, this can prevent innocent app wrongly succesed to open amdgpu dri device. FLR_NOFITY is received due to an IDLE hang detected from hypervisor side which indicating GPU is already die in this VF. v2: use MACRO as the offset of mailbox_control register don't test if NOTIFY_CMPL event in rcv_msg since it won't recieve that message anymore Signed-off-by: Monk Liu <Monk.Liu@amd.com> Reviewed-by: Pixel Ding <Pixel.Ding@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-18	drm/amdgpu: rename amdgpu_gpu_recover	Alex Deucher
	add device to the name for consistency. Acked-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-15	drm/amdgpu: Simplify amdgpu_lockup_timeout usage.	Andrey Grodzovsky
	With introduction of amdgpu_gpu_recovery we don't need any more to rely on amdgpu_lockup_timeout == 0 for disabling GPU reset. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-15	drm/amdgpu: Add gpu_recovery parameter	Andrey Grodzovsky
	Add new parameter to control GPU recovery procedure. v2: Add auto logic where reset is disabled for bare metal and enabled for SR-IOV. Allow forced reset from debugfs. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-08	drm/admgpu: Reduce the usage of soc15ip.h	Shaoyun Liu
	Remove the header where it's not used. Acked-by: Christian Konig <christian.koenig@amd.com> Signed-off-by: Shaoyun Liu <Shaoyun.Liu@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-06	drm/amd/include:cleanup vega10 header files.	Feifei Xu
	Remove asic_reg/vega10 folder. Signed-off-by: Feifei Xu <Feifei.Xu@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-06	drm/amd/include:cleanup vega10 nbio header files.	Feifei Xu
	Cleanup asic_reg/vega10/NBIO folder. Signed-off-by: Feifei Xu <Feifei.Xu@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2017-12-06	drm/amd/include:cleanup vega10 gc header files.	Feifei Xu
	Cleanup asic_reg/vega10/GC folder. Signed-off-by: Feifei Xu <Feifei.Xu@amd.com> Signed-off-by: Feifei Xu <Feifei.Xu@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>