-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
drm/amdkfd: fix the incorrect exception handling logic in function amd_acquire() #176
base: master
Are you sure you want to change the base?
Conversation
The parameters segment_width and last_segment_width are used to control the configuration of the Output Plane Processor (OPP), specifically the width of each segment that the display is divided into and the width of the last segment Fixes the below with gcc W=1: drivers/gpu/drm/amd/amdgpu/../display/dc/optc/dcn35/dcn35_optc.c:59: warning: Function parameter or struct member 'segment_width' not described in 'optc35_set_odm_combine' drivers/gpu/drm/amd/amdgpu/../display/dc/optc/dcn35/dcn35_optc.c:59: warning: Function parameter or struct member 'last_segment_width' not described in 'optc35_set_odm_combine' drivers/gpu/drm/amd/amdgpu/../display/dc/optc/dcn35/dcn35_optc.c:59: warning: Excess function parameter 'timing' description in 'optc35_set_odm_combine' Cc: Tom Chung <[email protected]> Cc: Rodrigo Siqueira <[email protected]> Cc: Roman Li <[email protected]> Cc: Aurabindo Pillai <[email protected]> Signed-off-by: Srinivasan Shanmugam <[email protected]> Reviewed-by: Tom Chung <[email protected]>
Align with new port same as smu 13.x. Signed-off-by: Kenneth Feng <[email protected]> Reviewed-by: Jack Gui <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
Update the capabilities for supporting 8k encoding. Reviewed-by: David (Ming Qiang) Wu <[email protected]> Acked-by: Alex Deucher <[email protected]> Signed-off-by: Ruijing Dong <[email protected]>
The following commit updated gmc->noretry from 0 to 1 for GC HW IP 9.3.0: commit 5f3854f ("drm/amdgpu: add more cases to noretry=1") This causes the device to hang when a page fault occurs, until the device is rebooted. Instead, revert back to gmc->noretry=0 so the device is still responsive. Fixes: 5f3854f ("drm/amdgpu: add more cases to noretry=1") Signed-off-by: Tim Van Patten <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
./drivers/gpu/drm/amd/amdgpu/amdgpu.h: amdgpu_umsch_mm.h is included more than once. Reported-by: Abaci Robot <[email protected]> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=9063 Signed-off-by: Jiapeng Chong <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
…ing_set_wptr This commit removes a duplicate check for *is_queue_unmap in the sdma_v7_0_ring_set_wptr function. The check at line 171 was considered dead code because at this point in the code, we already know that *is_queue_unmap is false due to the check at line 161. By removing this unnecessary check, improves the readability of the code Fixes the below: drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c:171 sdma_v7_0_ring_set_wptr() warn: duplicate check '*is_queue_unmap' (previous on line 161) drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c 140 static void sdma_v7_0_ring_set_wptr(struct amdgpu_ring *ring) 141 { 142 struct amdgpu_device *adev = ring->adev; 143 uint32_t *wptr_saved; 144 uint32_t *is_queue_unmap; 145 uint64_t aggregated_db_index; 146 uint32_t mqd_size = adev->mqds[AMDGPU_HW_IP_DMA].mqd_size; 147 148 DRM_DEBUG("Setting write pointer\n"); 149 150 if (ring->is_mes_queue) { 151 wptr_saved = (uint32_t *)(ring->mqd_ptr + mqd_size); 152 is_queue_unmap = (uint32_t *)(ring->mqd_ptr + mqd_size + ^^^^^^^^^^^^^^^^ Set here 153 sizeof(uint32_t)); 154 aggregated_db_index = 155 amdgpu_mes_get_aggregated_doorbell_index(adev, 156 ring->hw_prio); 157 158 atomic64_set((atomic64_t *)ring->wptr_cpu_addr, 159 ring->wptr << 2); 160 *wptr_saved = ring->wptr << 2; 161 if (*is_queue_unmap) { ^^^^^^^^^^^^^^^ Checked here 162 WDOORBELL64(aggregated_db_index, ring->wptr << 2); 163 DRM_DEBUG("calling WDOORBELL64(0x%08x, 0x%016llx)\n", 164 ring->doorbell_index, ring->wptr << 2); 165 WDOORBELL64(ring->doorbell_index, ring->wptr << 2); 166 } else { 167 DRM_DEBUG("calling WDOORBELL64(0x%08x, 0x%016llx)\n", 168 ring->doorbell_index, ring->wptr << 2); 169 WDOORBELL64(ring->doorbell_index, ring->wptr << 2); 170 --> 171 if (*is_queue_unmap) ^^^^^^^^^^^^^^^ This is dead code. We know it's false. 172 WDOORBELL64(aggregated_db_index, 173 ring->wptr << 2); 174 } 175 } else { 176 if (ring->use_doorbell) { 177 DRM_DEBUG("Using doorbell -- " 178 "wptr_offs == 0x%08x " Fixes: 6d9c711786e6 ("drm/amdgpu: Add sdma v7_0 ip block support (v7)") Cc: Likun Gao <[email protected]> Cc: Hawking Zhang <[email protected]> Cc: Christian König <[email protected]> Cc: Alex Deucher <[email protected]> Reported-by: Dan Carpenter <[email protected]> Signed-off-by: Srinivasan Shanmugam <[email protected]> Reviewed-by: Likun Gao <[email protected]> Reviewed-by: Asad Kamal <[email protected]>
modify the lock type to 'spinlock' to avoid schedule issue in interrupt context. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]>
Add support to set/get information about different DPM policies. The support is only available on SOCs which use swsmu architecture. A DPM policy type may be defined with different levels. For example, a policy may be defined to select Pstate preference and then later a pstate preference may be chosen. Signed-off-by: Lijo Lazar <[email protected]> Acked-by: Alex Deucher <[email protected]> Reviewed-by: Asad Kamal <[email protected]>
It's caused by 7535d371a27cc788d8f41394a22e3fd492ee5d2c "drm/amd/pm: Add support for DPM policies" Signed-off-by: Bob Zhou <[email protected]> Reviewed-by: Asher Song <[email protected]>
Add PMF message to select a Pstate policy in SOCs with SMU v13.0.6. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Asad Kamal <[email protected]> Acked-by: Alex Deucher <[email protected]>
Add support to select pstate policy in SOCs with SMUv13.0.6 Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Asad Kamal <[email protected]> Acked-by: Alex Deucher <[email protected]>
Add support to set XGMI PLPD policy levels through 'pm_policy/xgmi_plpd' sysfs node. Signed-off-by: Lijo Lazar <[email protected]> Acked-by: Alex Deucher <[email protected]> Reviewed-by: Asad Kamal <[email protected]>
On SOCs with SMU v13.0.6, allow changing xgmi plpd policy through 'pm_policy/xgmi_plpd' sysfs interface. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Asad Kamal <[email protected]> Acked-by: Alex Deucher <[email protected]>
On aldebaran, allow changing xgmi plpd policy through 'pm_policy/xgmi_plpd' sysfs interface. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Asad Kamal <[email protected]> Acked-by: Alex Deucher <[email protected]>
modify the lock type to 'spinlock' to avoid schedule issue in interrupt context. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]>
On arcturus, allow changing xgmi plpd policy through 'pm_policy/xgmi_plpd' sysfs interface. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Asad Kamal <[email protected]> Acked-by: Alex Deucher <[email protected]>
Replace the legacy interface with amdgpu_dpm_set_pm_policy to set XGMI PLPD mode. Also, xgmi_plpd_policy sysfs node is not used by any client. Remove that as well. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Reviewed-by: Asad Kamal <[email protected]> Acked-by: Alex Deucher <[email protected]>
fix ACA no query result after gpu reset. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]>
Move dsc functions from dc.c to dc_dsc.c. Co-Developed-by: George Shen <[email protected]> Signed-off-by: Wenjing Liu <[email protected]> Reviewed-by: Rodrigo Siqueira <[email protected]> Tested-by: Daniel Wheeler <[email protected]>
The function that commits planes calls the same set of functions twice, and in the case of the FAMs utilization, it is not desired to call the dmub, hwss_build and hwss_execute. This commit just removes the unnecessary calls to those functions. Acked-by: Roman Li <[email protected]> Signed-off-by: Rodrigo Siqueira <[email protected]> Tested-by: Daniel Wheeler <[email protected]>
…or dcn401 The functions are missing. These two functions are required to support MST. Reviewed-by: Rodrigo Siqueira <[email protected]> Signed-off-by: Wenjing Liu <[email protected]> Tested-by: Daniel Wheeler <[email protected]>
The comparisons intend to be DCN401 inclusive, and fix it by adding equal signs. Reviewed-by: Rodrigo Siqueira <[email protected]> Signed-off-by: Alex Hung <[email protected]> Tested-by: Daniel Wheeler <[email protected]>
[Why] DVI is TMDS signal like HDMI but without audio. Current signal check does not correctly reflect DVI clock programming. [How] Define a new signal check for TMDS that includes DVI to HDMI TMDS programming. Reviewed-by: Rodrigo Siqueira <[email protected]> Signed-off-by: Chris Park <[email protected]> Tested-by: Daniel Wheeler <[email protected]>
This fixes indentations and adjust spaces for better readability and code styles. Reviewed-by: Rodrigo Siqueira <[email protected]> Signed-off-by: Alex Hung <[email protected]> Tested-by: Daniel Wheeler <[email protected]>
[Why] Enable adaptive scaler support for DCN401 [How] - Enable build flag for SPL - Set prefer_easf flag to true - Apply light linear scaling policy based on transfer function and pixel format. Choose between linear or non-linear scaling - Set matrix_mode based on pixel format - Disable ring estimator - Add missing EASF register defines, masks, and writes - Disable EASF if scale ratio or number of taps is unsupported and when bypassing the scaler - Add debug flags and registry keys for debugging SPL and EASF - Add support for Visual Confirm with EASF Reviewed-by: Rodrigo Siqueira <[email protected]> Signed-off-by: Samson Tam <[email protected]> Tested-by: Daniel Wheeler <[email protected]>
[why] Cleaning up the code refactor requires hubbub to be in its own component. [how] Move all DCN401 files under newly created hubbub folder and fixing the makefiles. Reviewed-by: Rodrigo Siqueira <[email protected]> Signed-off-by: Harikrishna Revalla <[email protected]> Tested-by: Daniel Wheeler <[email protected]>
Need to select DTBCLK and DPREFCLK as DTBCLK_p source according to hardware guidance. Reviewed-by: Rodrigo Siqueira <[email protected]> Signed-off-by: Dillon Varone <[email protected]> Tested-by: Daniel Wheeler <[email protected]>
Acked-by: Roman Li <[email protected]> Signed-off-by: Rodrigo Siqueira <[email protected]> Tested-by: Daniel Wheeler <[email protected]>
update driver-if interface for smu 14.0.2/3 Signed-off-by: Kenneth Feng <[email protected]> Reviewed-by: Yang Wang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
[Why] Coverity reports NULL_RETURN warning. [How] Add pointer NULL check. Reviewed-by: Rodrigo Siqueira <[email protected]> Signed-off-by: Hersen Wu <[email protected]> Tested-by: Daniel Wheeler <[email protected]>
Add debugfs nodes for enabling/disabling and tuning parameters used in phase detect. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>
Add support for enabling phase detect and tuning params for SMUv13.0.6 SOCs. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>
Phase detect controls are only available for SMUv13.0.6 dGPUs. Create control object only on those. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Feifei Xu <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>
v3.x changed the how vram width was encoded. The previous implementation actually worked correctly for most boards. Fix the implementation to work correctly everywhere. This fixes the vram width reported in the kernel log on some boards. Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
Limiting the allocation of higher order pages to the closest NUMA node and enabling direct memory reclaim provides not only failsafe against situations when memory becomes too much fragmented and the allocator is not able to satisfy the request from the local node but falls back to remote pages (HUGEPAGE) but also offers performance improvement. Accessing remote pages suffers due to bandwidth limitations and could be avoided if memory becomes defragmented and in most cases without using manual compaction. (/proc/sys/vm/compact_memory) Note: On certain distros such as RHEL, the proactive compaction is disabled. (https://tinyurl.com/4f32f7rs) v2 (chk): drop __GFP_RECLAIM since that is already set by GFP_USER Cc: Dave Airlie <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Daniel Vetter <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Rajneesh Bhardwaj <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] Signed-off-by: Christian König <[email protected]> (cherry picked from commit 73534de04176394f72d0ca0e7990bf03addcf5ea)
Add support to get phase detect residency through debugfs Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>
Add support to get phase detect residency information on SMUv13.0.6 SOCs. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>
Sysfs node disable query error count during gpu reset. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Stanley.Yang <[email protected]>
Cache the PCI state before bus master is disabled. The saved state is later used for other cases like restoring config space after mode-2 reset. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Feifei Xu <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Fixes: 5c03e58 ("drm/amdgpu:add smu mode1/2 support for aldebaran")
To align with firmware, hbm id field 0x1 refers to hbm stack 0, 0x2 refers to hbm statck 1. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]>
Data abort exception and unknown errors are supported. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]>
In the convenience of calling it globally. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>
Return RMA status without message print. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>
Instead of printing GPU reset failed. v2: add check for reset_context->src. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>
KIQ timeouts no longer seen. This reverts commit b4d12cc00ad69e8a0dea2ece7202bacfd8b894fb. Signed-off-by: Victor Skvortsov <[email protected]> Reviewed-by: Zhigang Luo <[email protected]>
VFs do not perform HW fini/suspend in FLR, so the dpm_enabled is incorrectly kept enabled. Add interface to disable it in virt_pre_reset call. v2: Made implementation generic for all asics v3: Re-order conditionals so PP_MP1_STATE_FLR is only evaluated on VF Signed-off-by: Victor Skvortsov <[email protected]> Reviewed-by: Lijo Lazar <[email protected]>
Register access from userspace should be blocked until reset is complete. Signed-off-by: Victor Skvortsov <[email protected]> Reviewed-by: Alex Deucher <[email protected]>
Stop waiting for the KIQ to return back when there is a reset pending. It's quite likely that the KIQ will never response. Signed-off-by: Koenig Christian <[email protected]> Suggested-by: Lazar Lijo <[email protected]> Tested-by: Victor Skvortsov <[email protected]> Signed-off-by: Victor Skvortsov <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>
[background] when unloading amdgpu driver right after running a workload, drain_workqueue is causing "Fence fallback timer expired on ring sdma0.0". Under sriov, this issue will cause sriov full access timeout and a reset happening. move drain_workqueue before shutdown is set to allow ih process and before enter full access under sriov to avoid full access time cost. Signed-off-by: Victor Zhao <[email protected]> Reviewed-by: Feifei Xu <[email protected]>
Driver mode-2 is only supported by relative new smc firmware. Signed-off-by: Hawking Zhang <[email protected]> Reviewed-by: Tao Zhou <[email protected]>
This was only used as workaround for recovering the page tables after VRAM was lost and is no longer necessary after the function amdgpu_vm_bo_reset_state_machine() started to do the same. Compute never used shadows either, so the only proplematic case left is SVM and that is most likely not recoverable in any way when VRAM is lost. Signed-off-by: Christian König <[email protected]> Acked-by: Lijo Lazar <[email protected]>
Query pmfw version to determine if aux sos fw needs to be loaded in psp v13.0. v2: refine callback to check if aux_fw loading is needed instead of getting pmfw version barely v3: return the comparison directly Signed-off-by: Le Ma <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>
To be compatible with legacy IFWI, driver needs to carry legacy tOS and query pmfw version to load them accordingly. Add psp_firmware_header_v2_1 to handle the combined sos binary. Double the sos count limit for the case of aux sos fw packed. v2: pass the correct fw_bin_desc to parse_sos_bin_descriptor Signed-off-by: Le Ma <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>
…ments It's caused by v6.9-rc6-1554-g8a0a7b98d4b6 drm/mst: Fix NULL pointer dereference at drm_dp_add_payload_part2 Signed-off-by: Bob Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>
In multiple GPUs case, after a GPU has started resetting all GPUs on hive, other GPUs do not need to trigger GPU reset again. Signed-off-by: YiPeng Chai <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>
…d_acquire() In function amd_acquire(), kfd_get_process() is call to get process. When judge whether we get an exception pointer, we shouldn't judge whether it's a null pointer, because kfd_get_process will return ERR_PTR(-EINVAL) instead of null pointer if error. Because of this wrong logic, the kernel will panic then once kfd_get_process() returns ERR_PTR(-EINVAL). So, the correct logic should be: if (IS_ERR(p)) { Fixes: commit 779b4d0("drm/amdkfd: Add RDMA and PeerDirect support") Signed-off-by: Cruz Zhao <[email protected]>
Looks good. We will merge it soon. |
Is the patch submitted to upstream? I couldn't find it at: |
PeerDirect isn't upstreamable, so it cannot go to amd-staging-drm-next (which is what the amd-gfx mailing list is for). You'll notice that kfd_peerdirect.c isn't even present in the amd-staging-drm-next branch. However, I can confirm that this patch was picked internally, and will be in the upcoming ROCm release |
In function amd_acquire(), kfd_get_process() is call to get process. When judge whether we get an exception pointer, we shouldn't judge whether it's a null pointer, because kfd_get_process will return ERR_PTR(-EINVAL) instead of null pointer if error.
Because of this wrong logic, the kernel will panic then once kfd_get_process() returns ERR_PTR(-EINVAL).
So, the correct logic should be:
if (IS_ERR(p)) {
Fixes: commit 779b4d0("drm/amdkfd: Add RDMA and PeerDirect support")
fixes: #175