drm/amdkfd: fix the incorrect exception handling logic in function amd_acquire() #176

zxpdemonio · 2024-10-31T04:37:33Z

In function amd_acquire(), kfd_get_process() is call to get process. When judge whether we get an exception pointer, we shouldn't judge whether it's a null pointer, because kfd_get_process will return ERR_PTR(-EINVAL) instead of null pointer if error.

Because of this wrong logic, the kernel will panic then once kfd_get_process() returns ERR_PTR(-EINVAL).

So, the correct logic should be:
if (IS_ERR(p)) {

Fixes: commit 779b4d0("drm/amdkfd: Add RDMA and PeerDirect support")
fixes: #175

The parameters segment_width and last_segment_width are used to control the configuration of the Output Plane Processor (OPP), specifically the width of each segment that the display is divided into and the width of the last segment Fixes the below with gcc W=1: drivers/gpu/drm/amd/amdgpu/../display/dc/optc/dcn35/dcn35_optc.c:59: warning: Function parameter or struct member 'segment_width' not described in 'optc35_set_odm_combine' drivers/gpu/drm/amd/amdgpu/../display/dc/optc/dcn35/dcn35_optc.c:59: warning: Function parameter or struct member 'last_segment_width' not described in 'optc35_set_odm_combine' drivers/gpu/drm/amd/amdgpu/../display/dc/optc/dcn35/dcn35_optc.c:59: warning: Excess function parameter 'timing' description in 'optc35_set_odm_combine' Cc: Tom Chung <[email protected]> Cc: Rodrigo Siqueira <[email protected]> Cc: Roman Li <[email protected]> Cc: Aurabindo Pillai <[email protected]> Signed-off-by: Srinivasan Shanmugam <[email protected]> Reviewed-by: Tom Chung <[email protected]>

Align with new port same as smu 13.x. Signed-off-by: Kenneth Feng <[email protected]> Reviewed-by: Jack Gui <[email protected]> Signed-off-by: Alex Deucher <[email protected]>

Update the capabilities for supporting 8k encoding. Reviewed-by: David (Ming Qiang) Wu <[email protected]> Acked-by: Alex Deucher <[email protected]> Signed-off-by: Ruijing Dong <[email protected]>

The following commit updated gmc->noretry from 0 to 1 for GC HW IP 9.3.0: commit 5f3854f ("drm/amdgpu: add more cases to noretry=1") This causes the device to hang when a page fault occurs, until the device is rebooted. Instead, revert back to gmc->noretry=0 so the device is still responsive. Fixes: 5f3854f ("drm/amdgpu: add more cases to noretry=1") Signed-off-by: Tim Van Patten <[email protected]> Signed-off-by: Alex Deucher <[email protected]>

./drivers/gpu/drm/amd/amdgpu/amdgpu.h: amdgpu_umsch_mm.h is included more than once. Reported-by: Abaci Robot <[email protected]> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=9063 Signed-off-by: Jiapeng Chong <[email protected]> Signed-off-by: Alex Deucher <[email protected]>

…ing_set_wptr This commit removes a duplicate check for *is_queue_unmap in the sdma_v7_0_ring_set_wptr function. The check at line 171 was considered dead code because at this point in the code, we already know that *is_queue_unmap is false due to the check at line 161. By removing this unnecessary check, improves the readability of the code Fixes the below: drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c:171 sdma_v7_0_ring_set_wptr() warn: duplicate check '*is_queue_unmap' (previous on line 161) drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c 140 static void sdma_v7_0_ring_set_wptr(struct amdgpu_ring *ring) 141 { 142 struct amdgpu_device *adev = ring->adev; 143 uint32_t *wptr_saved; 144 uint32_t *is_queue_unmap; 145 uint64_t aggregated_db_index; 146 uint32_t mqd_size = adev->mqds[AMDGPU_HW_IP_DMA].mqd_size; 147 148 DRM_DEBUG("Setting write pointer\n"); 149 150 if (ring->is_mes_queue) { 151 wptr_saved = (uint32_t *)(ring->mqd_ptr + mqd_size); 152 is_queue_unmap = (uint32_t *)(ring->mqd_ptr + mqd_size + ^^^^^^^^^^^^^^^^ Set here 153 sizeof(uint32_t)); 154 aggregated_db_index = 155 amdgpu_mes_get_aggregated_doorbell_index(adev, 156 ring->hw_prio); 157 158 atomic64_set((atomic64_t *)ring->wptr_cpu_addr, 159 ring->wptr << 2); 160 *wptr_saved = ring->wptr << 2; 161 if (*is_queue_unmap) { ^^^^^^^^^^^^^^^ Checked here 162 WDOORBELL64(aggregated_db_index, ring->wptr << 2); 163 DRM_DEBUG("calling WDOORBELL64(0x%08x, 0x%016llx)\n", 164 ring->doorbell_index, ring->wptr << 2); 165 WDOORBELL64(ring->doorbell_index, ring->wptr << 2); 166 } else { 167 DRM_DEBUG("calling WDOORBELL64(0x%08x, 0x%016llx)\n", 168 ring->doorbell_index, ring->wptr << 2); 169 WDOORBELL64(ring->doorbell_index, ring->wptr << 2); 170 --> 171 if (*is_queue_unmap) ^^^^^^^^^^^^^^^ This is dead code. We know it's false. 172 WDOORBELL64(aggregated_db_index, 173 ring->wptr << 2); 174 } 175 } else { 176 if (ring->use_doorbell) { 177 DRM_DEBUG("Using doorbell -- " 178 "wptr_offs == 0x%08x " Fixes: 6d9c711786e6 ("drm/amdgpu: Add sdma v7_0 ip block support (v7)") Cc: Likun Gao <[email protected]> Cc: Hawking Zhang <[email protected]> Cc: Christian König <[email protected]> Cc: Alex Deucher <[email protected]> Reported-by: Dan Carpenter <[email protected]> Signed-off-by: Srinivasan Shanmugam <[email protected]> Reviewed-by: Likun Gao <[email protected]> Reviewed-by: Asad Kamal <[email protected]>

modify the lock type to 'spinlock' to avoid schedule issue in interrupt context. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]>

Add support to set/get information about different DPM policies. The support is only available on SOCs which use swsmu architecture. A DPM policy type may be defined with different levels. For example, a policy may be defined to select Pstate preference and then later a pstate preference may be chosen. Signed-off-by: Lijo Lazar <[email protected]> Acked-by: Alex Deucher <[email protected]> Reviewed-by: Asad Kamal <[email protected]>

It's caused by 7535d371a27cc788d8f41394a22e3fd492ee5d2c "drm/amd/pm: Add support for DPM policies" Signed-off-by: Bob Zhou <[email protected]> Reviewed-by: Asher Song <[email protected]>

Add PMF message to select a Pstate policy in SOCs with SMU v13.0.6. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Asad Kamal <[email protected]> Acked-by: Alex Deucher <[email protected]>

Add support to select pstate policy in SOCs with SMUv13.0.6 Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Asad Kamal <[email protected]> Acked-by: Alex Deucher <[email protected]>

Add support to set XGMI PLPD policy levels through 'pm_policy/xgmi_plpd' sysfs node. Signed-off-by: Lijo Lazar <[email protected]> Acked-by: Alex Deucher <[email protected]> Reviewed-by: Asad Kamal <[email protected]>

On SOCs with SMU v13.0.6, allow changing xgmi plpd policy through 'pm_policy/xgmi_plpd' sysfs interface. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Asad Kamal <[email protected]> Acked-by: Alex Deucher <[email protected]>

On aldebaran, allow changing xgmi plpd policy through 'pm_policy/xgmi_plpd' sysfs interface. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Asad Kamal <[email protected]> Acked-by: Alex Deucher <[email protected]>

modify the lock type to 'spinlock' to avoid schedule issue in interrupt context. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]>

On arcturus, allow changing xgmi plpd policy through 'pm_policy/xgmi_plpd' sysfs interface. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Asad Kamal <[email protected]> Acked-by: Alex Deucher <[email protected]>

Replace the legacy interface with amdgpu_dpm_set_pm_policy to set XGMI PLPD mode. Also, xgmi_plpd_policy sysfs node is not used by any client. Remove that as well. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Asad Kamal <[email protected]> Acked-by: Alex Deucher <[email protected]>

fix ACA no query result after gpu reset. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]>

Move dsc functions from dc.c to dc_dsc.c. Co-Developed-by: George Shen <[email protected]> Signed-off-by: Wenjing Liu <[email protected]> Reviewed-by: Rodrigo Siqueira <[email protected]> Tested-by: Daniel Wheeler <[email protected]>

The function that commits planes calls the same set of functions twice, and in the case of the FAMs utilization, it is not desired to call the dmub, hwss_build and hwss_execute. This commit just removes the unnecessary calls to those functions. Acked-by: Roman Li <[email protected]> Signed-off-by: Rodrigo Siqueira <[email protected]> Tested-by: Daniel Wheeler <[email protected]>

…or dcn401 The functions are missing. These two functions are required to support MST. Reviewed-by: Rodrigo Siqueira <[email protected]> Signed-off-by: Wenjing Liu <[email protected]> Tested-by: Daniel Wheeler <[email protected]>

The comparisons intend to be DCN401 inclusive, and fix it by adding equal signs. Reviewed-by: Rodrigo Siqueira <[email protected]> Signed-off-by: Alex Hung <[email protected]> Tested-by: Daniel Wheeler <[email protected]>

[Why] DVI is TMDS signal like HDMI but without audio. Current signal check does not correctly reflect DVI clock programming. [How] Define a new signal check for TMDS that includes DVI to HDMI TMDS programming. Reviewed-by: Rodrigo Siqueira <[email protected]> Signed-off-by: Chris Park <[email protected]> Tested-by: Daniel Wheeler <[email protected]>

This fixes indentations and adjust spaces for better readability and code styles. Reviewed-by: Rodrigo Siqueira <[email protected]> Signed-off-by: Alex Hung <[email protected]> Tested-by: Daniel Wheeler <[email protected]>

[Why] Enable adaptive scaler support for DCN401 [How] - Enable build flag for SPL - Set prefer_easf flag to true - Apply light linear scaling policy based on transfer function and pixel format. Choose between linear or non-linear scaling - Set matrix_mode based on pixel format - Disable ring estimator - Add missing EASF register defines, masks, and writes - Disable EASF if scale ratio or number of taps is unsupported and when bypassing the scaler - Add debug flags and registry keys for debugging SPL and EASF - Add support for Visual Confirm with EASF Reviewed-by: Rodrigo Siqueira <[email protected]> Signed-off-by: Samson Tam <[email protected]> Tested-by: Daniel Wheeler <[email protected]>

[why] Cleaning up the code refactor requires hubbub to be in its own component. [how] Move all DCN401 files under newly created hubbub folder and fixing the makefiles. Reviewed-by: Rodrigo Siqueira <[email protected]> Signed-off-by: Harikrishna Revalla <[email protected]> Tested-by: Daniel Wheeler <[email protected]>

Need to select DTBCLK and DPREFCLK as DTBCLK_p source according to hardware guidance. Reviewed-by: Rodrigo Siqueira <[email protected]> Signed-off-by: Dillon Varone <[email protected]> Tested-by: Daniel Wheeler <[email protected]>

Acked-by: Roman Li <[email protected]> Signed-off-by: Rodrigo Siqueira <[email protected]> Tested-by: Daniel Wheeler <[email protected]>

update driver-if interface for smu 14.0.2/3 Signed-off-by: Kenneth Feng <[email protected]> Reviewed-by: Yang Wang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>

[Why] Coverity reports NULL_RETURN warning. [How] Add pointer NULL check. Reviewed-by: Rodrigo Siqueira <[email protected]> Signed-off-by: Hersen Wu <[email protected]> Tested-by: Daniel Wheeler <[email protected]>

Add debugfs nodes for enabling/disabling and tuning parameters used in phase detect. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>

Add support for enabling phase detect and tuning params for SMUv13.0.6 SOCs. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>

Phase detect controls are only available for SMUv13.0.6 dGPUs. Create control object only on those. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Feifei Xu <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>

v3.x changed the how vram width was encoded. The previous implementation actually worked correctly for most boards. Fix the implementation to work correctly everywhere. This fixes the vram width reported in the kernel log on some boards. Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>

Limiting the allocation of higher order pages to the closest NUMA node and enabling direct memory reclaim provides not only failsafe against situations when memory becomes too much fragmented and the allocator is not able to satisfy the request from the local node but falls back to remote pages (HUGEPAGE) but also offers performance improvement. Accessing remote pages suffers due to bandwidth limitations and could be avoided if memory becomes defragmented and in most cases without using manual compaction. (/proc/sys/vm/compact_memory) Note: On certain distros such as RHEL, the proactive compaction is disabled. (https://tinyurl.com/4f32f7rs) v2 (chk): drop __GFP_RECLAIM since that is already set by GFP_USER Cc: Dave Airlie <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Daniel Vetter <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Rajneesh Bhardwaj <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] Signed-off-by: Christian König <[email protected]> (cherry picked from commit 73534de04176394f72d0ca0e7990bf03addcf5ea)

Add support to get phase detect residency through debugfs Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>

Add support to get phase detect residency information on SMUv13.0.6 SOCs. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>

Sysfs node disable query error count during gpu reset. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Stanley.Yang <[email protected]>

Cache the PCI state before bus master is disabled. The saved state is later used for other cases like restoring config space after mode-2 reset. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Feifei Xu <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Fixes: 5c03e58 ("drm/amdgpu:add smu mode1/2 support for aldebaran")

To align with firmware, hbm id field 0x1 refers to hbm stack 0, 0x2 refers to hbm statck 1. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]>

Data abort exception and unknown errors are supported. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]>

In the convenience of calling it globally. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>

Return RMA status without message print. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>

Instead of printing GPU reset failed. v2: add check for reset_context->src. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>

KIQ timeouts no longer seen. This reverts commit b4d12cc00ad69e8a0dea2ece7202bacfd8b894fb. Signed-off-by: Victor Skvortsov <[email protected]> Reviewed-by: Zhigang Luo <[email protected]>

VFs do not perform HW fini/suspend in FLR, so the dpm_enabled is incorrectly kept enabled. Add interface to disable it in virt_pre_reset call. v2: Made implementation generic for all asics v3: Re-order conditionals so PP_MP1_STATE_FLR is only evaluated on VF Signed-off-by: Victor Skvortsov <[email protected]> Reviewed-by: Lijo Lazar <[email protected]>

Register access from userspace should be blocked until reset is complete. Signed-off-by: Victor Skvortsov <[email protected]> Reviewed-by: Alex Deucher <[email protected]>

Stop waiting for the KIQ to return back when there is a reset pending. It's quite likely that the KIQ will never response. Signed-off-by: Koenig Christian <[email protected]> Suggested-by: Lazar Lijo <[email protected]> Tested-by: Victor Skvortsov <[email protected]> Signed-off-by: Victor Skvortsov <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>

[background] when unloading amdgpu driver right after running a workload, drain_workqueue is causing "Fence fallback timer expired on ring sdma0.0". Under sriov, this issue will cause sriov full access timeout and a reset happening. move drain_workqueue before shutdown is set to allow ih process and before enter full access under sriov to avoid full access time cost. Signed-off-by: Victor Zhao <[email protected]> Reviewed-by: Feifei Xu <[email protected]>

Driver mode-2 is only supported by relative new smc firmware. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]>

This was only used as workaround for recovering the page tables after VRAM was lost and is no longer necessary after the function amdgpu_vm_bo_reset_state_machine() started to do the same. Compute never used shadows either, so the only proplematic case left is SVM and that is most likely not recoverable in any way when VRAM is lost. Signed-off-by: Christian König <[email protected]> Acked-by: Lijo Lazar <[email protected]>

Query pmfw version to determine if aux sos fw needs to be loaded in psp v13.0. v2: refine callback to check if aux_fw loading is needed instead of getting pmfw version barely v3: return the comparison directly Signed-off-by: Le Ma <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>

To be compatible with legacy IFWI, driver needs to carry legacy tOS and query pmfw version to load them accordingly. Add psp_firmware_header_v2_1 to handle the combined sos binary. Double the sos count limit for the case of aux sos fw packed. v2: pass the correct fw_bin_desc to parse_sos_bin_descriptor Signed-off-by: Le Ma <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>

…ments It's caused by v6.9-rc6-1554-g8a0a7b98d4b6 drm/mst: Fix NULL pointer dereference at drm_dp_add_payload_part2 Signed-off-by: Bob Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>

In multiple GPUs case, after a GPU has started resetting all GPUs on hive, other GPUs do not need to trigger GPU reset again. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>

…d_acquire() In function amd_acquire(), kfd_get_process() is call to get process. When judge whether we get an exception pointer, we shouldn't judge whether it's a null pointer, because kfd_get_process will return ERR_PTR(-EINVAL) instead of null pointer if error. Because of this wrong logic, the kernel will panic then once kfd_get_process() returns ERR_PTR(-EINVAL). So, the correct logic should be: if (IS_ERR(p)) { Fixes: commit 779b4d0("drm/amdkfd: Add RDMA and PeerDirect support") Signed-off-by: Cruz Zhao <[email protected]>

hkasivis · 2024-10-31T13:55:08Z

Looks good. We will merge it soon.

whchung · 2024-11-19T10:43:27Z

Is the patch submitted to upstream? I couldn't find it at:
https://lists.freedesktop.org/archives/amd-gfx/2024-October/
https://lists.freedesktop.org/archives/amd-gfx/2024-November/

kentrussell · 2024-11-19T14:31:31Z

PeerDirect isn't upstreamable, so it cannot go to amd-staging-drm-next (which is what the amd-gfx mailing list is for). You'll notice that kfd_peerdirect.c isn't even present in the amd-staging-drm-next branch. However, I can confirm that this patch was picked internally, and will be in the upcoming ROCm release

srishanm and others added 30 commits May 20, 2024 09:56

drm/amd/pm: support mode1 reset on smu v14.0.3

0831390

Align with new port same as smu 13.x. Signed-off-by: Kenneth Feng <[email protected]> Reviewed-by: Jack Gui <[email protected]> Signed-off-by: Alex Deucher <[email protected]>

drm/amdgpu/vcn: update vcn5 enc/dec capabilities

e39fa4f

Update the capabilities for supporting 8k encoding. Reviewed-by: David (Ming Qiang) Wu <[email protected]> Acked-by: Alex Deucher <[email protected]> Signed-off-by: Ruijing Dong <[email protected]>

drm/amdgpu: change aca bank error lock type to spinlock

0f987f1

modify the lock type to 'spinlock' to avoid schedule issue in interrupt context. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]>

drm/amdkcl: wrap code under macro HAVE_PCI_DRIVER_DEV_GROUPS

73f33f5

It's caused by 7535d371a27cc788d8f41394a22e3fd492ee5d2c "drm/amd/pm: Add support for DPM policies" Signed-off-by: Bob Zhou <[email protected]> Reviewed-by: Asher Song <[email protected]>

drm/amd/pm: Add xgmi plpd policy to pm_policy

3d4acf8

Add support to set XGMI PLPD policy levels through 'pm_policy/xgmi_plpd' sysfs node. Signed-off-by: Lijo Lazar <[email protected]> Acked-by: Alex Deucher <[email protected]> Reviewed-by: Asad Kamal <[email protected]>

drm/amdgpu: change bank cache lock type to spinlock

49950b2

modify the lock type to 'spinlock' to avoid schedule issue in interrupt context. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]>

drm/amdgpu: fix ACA no query result after gpu reset

ad22597

fix ACA no query result after gpu reset. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]>

drm/amd/display: Fix incorrect DCN401 comparison

04d6d07

The comparisons intend to be DCN401 inclusive, and fix it by adding equal signs. Reviewed-by: Rodrigo Siqueira <[email protected]> Signed-off-by: Alex Hung <[email protected]> Tested-by: Daniel Wheeler <[email protected]>

drm/amd/display: Adjust incorrect indentations and spaces

9cfae3b

This fixes indentations and adjust spaces for better readability and code styles. Reviewed-by: Rodrigo Siqueira <[email protected]> Signed-off-by: Alex Hung <[email protected]> Tested-by: Daniel Wheeler <[email protected]>

drm/amd/display: Add missing DML2 var helpers

0d3ae04

Acked-by: Roman Li <[email protected]> Signed-off-by: Rodrigo Siqueira <[email protected]> Tested-by: Daniel Wheeler <[email protected]>

drm/amd/pm: update driver-if interface

98ad998

update driver-if interface for smu 14.0.2/3 Signed-off-by: Kenneth Feng <[email protected]> Reviewed-by: Yang Wang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>

drm/amd/display: Add NULL check within get_target_mpc_factor

5efb72d

[Why] Coverity reports NULL_RETURN warning. [How] Add pointer NULL check. Reviewed-by: Rodrigo Siqueira <[email protected]> Signed-off-by: Hersen Wu <[email protected]> Tested-by: Daniel Wheeler <[email protected]>

Lijo Lazar and others added 26 commits July 12, 2024 14:37

drm/amd/pm: Add debugfs controls for phase detect

f0f6439

Add debugfs nodes for enabling/disabling and tuning parameters used in phase detect. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>

drm/amd/pm: Add phase detect support to SMUv13.0.6

74fda16

Add support for enabling phase detect and tuning params for SMUv13.0.6 SOCs. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>

drm/amd/pm: Restrict phase detect to SMUv13.0.6

2889fc7

Phase detect controls are only available for SMUv13.0.6 dGPUs. Create control object only on those. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Feifei Xu <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>

drm/amd/pm: Add phase detect residency support

3431755

Add support to get phase detect residency through debugfs Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>

drm/amd/pm: Add SMUv13.0.6 phase detect residency

06b02f1

Add support to get phase detect residency information on SMUv13.0.6 SOCs. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>

drm/amdgpu: sysfs node disable query error count during gpu reset

41f5481

Sysfs node disable query error count during gpu reset. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Stanley.Yang <[email protected]>

drm/amdgpu: Fix hbm stack id in boot error report

a7ae922

To align with firmware, hbm id field 0x1 refers to hbm stack 0, 0x2 refers to hbm statck 1. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]>

drm/amdgpu: Add more types for boot time error reporting

2a5b596

Data abort exception and unknown errors are supported. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]>

drm/amdgpu: create function to check RAS RMA status

b406565

In the convenience of calling it globally. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>

drm/amdgpu: update bad state check in GPU recovery

67b85cc

Return RMA status without message print. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>

drm/amdgpu: report bad status in GPU recovery

d3956fa

Instead of printing GPU reset failed. v2: add check for reset_context->src. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>

Revert "drm/amdgpu: Extend KIQ reg polling wait for VF"

c260cf1

KIQ timeouts no longer seen. This reverts commit b4d12cc00ad69e8a0dea2ece7202bacfd8b894fb. Signed-off-by: Victor Skvortsov <[email protected]> Reviewed-by: Zhigang Luo <[email protected]>

drm/amdgpu: Block MMR_READ IOCTL in reset

703e9b7

Register access from userspace should be blocked until reset is complete. Signed-off-by: Victor Skvortsov <[email protected]> Reviewed-by: Alex Deucher <[email protected]>

drm/amdkfd: Select reset method for poison handling

8452061

Driver mode-2 is only supported by relative new smc firmware. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]>

drm/amdkcl: check drm_dp_add_payload_part2 whether requires two argun…

2f767b9

…ments It's caused by v6.9-rc6-1554-g8a0a7b98d4b6 drm/mst: Fix NULL pointer dereference at drm_dp_add_payload_part2 Signed-off-by: Bob Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>

amd/amdgpu: Reduce unnecessary repetitive GPU resets

8809b4b

In multiple GPUs case, after a GPU has started resetting all GPUs on hive, other GPUs do not need to trigger GPU reset again. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>

kentrussell force-pushed the master branch from 8809b4b to 627f890 Compare December 3, 2024 19:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

drm/amdkfd: fix the incorrect exception handling logic in function amd_acquire() #176

drm/amdkfd: fix the incorrect exception handling logic in function amd_acquire() #176

zxpdemonio commented Oct 31, 2024 •

edited

Loading

hkasivis commented Oct 31, 2024

whchung commented Nov 19, 2024

kentrussell commented Nov 19, 2024

drm/amdkfd: fix the incorrect exception handling logic in function amd_acquire() #176

Are you sure you want to change the base?

drm/amdkfd: fix the incorrect exception handling logic in function amd_acquire() #176

Conversation

zxpdemonio commented Oct 31, 2024 • edited Loading

hkasivis commented Oct 31, 2024

whchung commented Nov 19, 2024

kentrussell commented Nov 19, 2024

zxpdemonio commented Oct 31, 2024 •

edited

Loading