Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

drm/amdkfd: fix the incorrect exception handling logic in function amd_acquire() #176

Open
wants to merge 2,469 commits into
base: master
Choose a base branch
from

Conversation

zxpdemonio
Copy link

@zxpdemonio zxpdemonio commented Oct 31, 2024

In function amd_acquire(), kfd_get_process() is call to get process. When judge whether we get an exception pointer, we shouldn't judge whether it's a null pointer, because kfd_get_process will return ERR_PTR(-EINVAL) instead of null pointer if error.

Because of this wrong logic, the kernel will panic then once kfd_get_process() returns ERR_PTR(-EINVAL).

So, the correct logic should be:
if (IS_ERR(p)) {

Fixes: commit 779b4d0("drm/amdkfd: Add RDMA and PeerDirect support")
fixes: #175

srishanm and others added 30 commits May 20, 2024 09:56
The parameters segment_width and last_segment_width are used to control
the configuration of the Output Plane Processor (OPP), specifically the
width of each segment that the display is divided into and the width of
the last segment

Fixes the below with gcc W=1:
drivers/gpu/drm/amd/amdgpu/../display/dc/optc/dcn35/dcn35_optc.c:59: warning: Function parameter or struct member 'segment_width' not described in 'optc35_set_odm_combine'
drivers/gpu/drm/amd/amdgpu/../display/dc/optc/dcn35/dcn35_optc.c:59: warning: Function parameter or struct member 'last_segment_width' not described in 'optc35_set_odm_combine'
drivers/gpu/drm/amd/amdgpu/../display/dc/optc/dcn35/dcn35_optc.c:59: warning: Excess function parameter 'timing' description in 'optc35_set_odm_combine'

Cc: Tom Chung <[email protected]>
Cc: Rodrigo Siqueira <[email protected]>
Cc: Roman Li <[email protected]>
Cc: Aurabindo Pillai <[email protected]>
Signed-off-by: Srinivasan Shanmugam <[email protected]>
Reviewed-by: Tom Chung <[email protected]>
Align with new port same as smu 13.x.

Signed-off-by: Kenneth Feng <[email protected]>
Reviewed-by: Jack Gui <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Update the capabilities for supporting 8k encoding.

Reviewed-by: David (Ming Qiang) Wu <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Signed-off-by: Ruijing Dong <[email protected]>
The following commit updated gmc->noretry from 0 to 1 for GC HW IP
9.3.0:

    commit 5f3854f ("drm/amdgpu: add more cases to noretry=1")

This causes the device to hang when a page fault occurs, until the
device is rebooted. Instead, revert back to gmc->noretry=0 so the device
is still responsive.

Fixes: 5f3854f ("drm/amdgpu: add more cases to noretry=1")
Signed-off-by: Tim Van Patten <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
./drivers/gpu/drm/amd/amdgpu/amdgpu.h: amdgpu_umsch_mm.h is included more than once.

Reported-by: Abaci Robot <[email protected]>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=9063
Signed-off-by: Jiapeng Chong <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
…ing_set_wptr

This commit removes a duplicate check for *is_queue_unmap in the
sdma_v7_0_ring_set_wptr function. The check at line 171 was considered
dead code because at this point in the code, we already know that
*is_queue_unmap is false due to the check at line 161.

By removing this unnecessary check, improves the readability of the
code

Fixes the below:
	drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c:171 sdma_v7_0_ring_set_wptr()
	warn: duplicate check '*is_queue_unmap' (previous on line 161)

drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
    140 static void sdma_v7_0_ring_set_wptr(struct amdgpu_ring *ring)
    141 {
    142         struct amdgpu_device *adev = ring->adev;
    143         uint32_t *wptr_saved;
    144         uint32_t *is_queue_unmap;
    145         uint64_t aggregated_db_index;
    146         uint32_t mqd_size = adev->mqds[AMDGPU_HW_IP_DMA].mqd_size;
    147
    148         DRM_DEBUG("Setting write pointer\n");
    149
    150         if (ring->is_mes_queue) {
    151                 wptr_saved = (uint32_t *)(ring->mqd_ptr + mqd_size);
    152                 is_queue_unmap = (uint32_t *)(ring->mqd_ptr + mqd_size +
                        ^^^^^^^^^^^^^^^^ Set here

    153                                               sizeof(uint32_t));
    154                 aggregated_db_index =
    155                         amdgpu_mes_get_aggregated_doorbell_index(adev,
    156                                                          ring->hw_prio);
    157
    158                 atomic64_set((atomic64_t *)ring->wptr_cpu_addr,
    159                              ring->wptr << 2);
    160                 *wptr_saved = ring->wptr << 2;
    161                 if (*is_queue_unmap) {
                            ^^^^^^^^^^^^^^^ Checked here

    162                         WDOORBELL64(aggregated_db_index, ring->wptr << 2);
    163                         DRM_DEBUG("calling WDOORBELL64(0x%08x, 0x%016llx)\n",
    164                                         ring->doorbell_index, ring->wptr << 2);
    165                         WDOORBELL64(ring->doorbell_index, ring->wptr << 2);
    166                 } else {
    167                         DRM_DEBUG("calling WDOORBELL64(0x%08x, 0x%016llx)\n",
    168                                         ring->doorbell_index, ring->wptr << 2);
    169                         WDOORBELL64(ring->doorbell_index, ring->wptr << 2);
    170
--> 171                         if (*is_queue_unmap)
                                    ^^^^^^^^^^^^^^^ This is dead code.  We know it's false.

    172                                 WDOORBELL64(aggregated_db_index,
    173                                             ring->wptr << 2);
    174                 }
    175         } else {
    176                 if (ring->use_doorbell) {
    177                         DRM_DEBUG("Using doorbell -- "
    178                                   "wptr_offs == 0x%08x "

Fixes: 6d9c711786e6 ("drm/amdgpu: Add sdma v7_0 ip block support (v7)")
Cc: Likun Gao <[email protected]>
Cc: Hawking Zhang <[email protected]>
Cc: Christian König <[email protected]>
Cc: Alex Deucher <[email protected]>
Reported-by: Dan Carpenter <[email protected]>
Signed-off-by: Srinivasan Shanmugam <[email protected]>
Reviewed-by: Likun Gao <[email protected]>
Reviewed-by: Asad Kamal <[email protected]>
modify the lock type to 'spinlock' to avoid schedule issue
in interrupt context.

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Add support to set/get information about different DPM policies. The
support is only available on SOCs which use swsmu architecture.

A DPM policy type may be defined with different levels. For example, a
policy may be defined to select Pstate preference and then later a
pstate preference may be chosen.

Signed-off-by: Lijo Lazar <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Reviewed-by: Asad Kamal <[email protected]>
It's caused by 7535d371a27cc788d8f41394a22e3fd492ee5d2c
"drm/amd/pm: Add support for DPM policies"

Signed-off-by: Bob Zhou <[email protected]>
Reviewed-by: Asher Song <[email protected]>
Add PMF message to select a Pstate policy in SOCs with SMU v13.0.6.

Signed-off-by: Lijo Lazar <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Reviewed-by: Asad Kamal <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Add support to select pstate policy in SOCs with SMUv13.0.6

Signed-off-by: Lijo Lazar <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Reviewed-by: Asad Kamal <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Add support to set XGMI PLPD policy levels through 'pm_policy/xgmi_plpd'
sysfs node.

Signed-off-by: Lijo Lazar <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Reviewed-by: Asad Kamal <[email protected]>
On SOCs with SMU v13.0.6, allow changing xgmi plpd policy through
'pm_policy/xgmi_plpd' sysfs interface.

Signed-off-by: Lijo Lazar <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Reviewed-by: Asad Kamal <[email protected]>
Acked-by: Alex Deucher <[email protected]>
On aldebaran, allow changing xgmi plpd policy through
'pm_policy/xgmi_plpd' sysfs interface.

Signed-off-by: Lijo Lazar <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Reviewed-by: Asad Kamal <[email protected]>
Acked-by: Alex Deucher <[email protected]>
modify the lock type to 'spinlock' to avoid schedule issue
in interrupt context.

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
On arcturus, allow changing xgmi plpd policy through
'pm_policy/xgmi_plpd' sysfs interface.

Signed-off-by: Lijo Lazar <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Reviewed-by: Asad Kamal <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Replace the legacy interface with amdgpu_dpm_set_pm_policy to set XGMI
PLPD mode. Also, xgmi_plpd_policy sysfs node is not used by any client.
Remove that as well.

Signed-off-by: Lijo Lazar <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Reviewed-by: Asad Kamal <[email protected]>
Acked-by: Alex Deucher <[email protected]>
fix ACA no query result after gpu reset.

Signed-off-by: Yang Wang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Move dsc functions from dc.c to dc_dsc.c.

Co-Developed-by: George Shen <[email protected]>
Signed-off-by: Wenjing Liu <[email protected]>
Reviewed-by: Rodrigo Siqueira <[email protected]>
Tested-by: Daniel Wheeler <[email protected]>
The function that commits planes calls the same set of functions twice,
and in the case of the FAMs utilization, it is not desired to call the
dmub, hwss_build and hwss_execute. This commit just removes the
unnecessary calls to those functions.

Acked-by: Roman Li <[email protected]>
Signed-off-by: Rodrigo Siqueira <[email protected]>
Tested-by: Daniel Wheeler <[email protected]>
…or dcn401

The functions are missing. These two functions are required to support
MST.

Reviewed-by: Rodrigo Siqueira <[email protected]>
Signed-off-by: Wenjing Liu <[email protected]>
Tested-by: Daniel Wheeler <[email protected]>
The comparisons intend to be DCN401 inclusive, and fix it by adding
equal signs.

Reviewed-by: Rodrigo Siqueira <[email protected]>
Signed-off-by: Alex Hung <[email protected]>
Tested-by: Daniel Wheeler <[email protected]>
[Why]
DVI is TMDS signal like HDMI but without audio. Current signal check
does not correctly reflect DVI clock programming.

[How]
Define a new signal check for TMDS that includes DVI to HDMI TMDS
programming.

Reviewed-by: Rodrigo Siqueira <[email protected]>
Signed-off-by: Chris Park <[email protected]>
Tested-by: Daniel Wheeler <[email protected]>
This fixes indentations and adjust spaces for better readability and
code styles.

Reviewed-by: Rodrigo Siqueira <[email protected]>
Signed-off-by: Alex Hung <[email protected]>
Tested-by: Daniel Wheeler <[email protected]>
[Why]
Enable adaptive scaler support for DCN401

[How]
- Enable build flag for SPL
- Set prefer_easf flag to true
- Apply light linear scaling policy based on transfer function and pixel
  format.  Choose between linear or non-linear scaling
- Set matrix_mode based on pixel format
- Disable ring estimator
- Add missing EASF register defines, masks, and writes
- Disable EASF if scale ratio or number of taps is unsupported and when
  bypassing the scaler
- Add debug flags and registry keys for debugging SPL and EASF
- Add support for Visual Confirm with EASF

Reviewed-by: Rodrigo Siqueira <[email protected]>
Signed-off-by: Samson Tam <[email protected]>
Tested-by: Daniel Wheeler <[email protected]>
[why]
Cleaning up the code refactor requires hubbub to be in its own
component.

[how]
Move all DCN401 files under newly created hubbub folder and fixing the
makefiles.

Reviewed-by: Rodrigo Siqueira <[email protected]>
Signed-off-by: Harikrishna Revalla <[email protected]>
Tested-by: Daniel Wheeler <[email protected]>
Need to select DTBCLK and DPREFCLK as DTBCLK_p source according to
hardware guidance.

Reviewed-by: Rodrigo Siqueira <[email protected]>
Signed-off-by: Dillon Varone <[email protected]>
Tested-by: Daniel Wheeler <[email protected]>
Acked-by: Roman Li <[email protected]>
Signed-off-by: Rodrigo Siqueira <[email protected]>
Tested-by: Daniel Wheeler <[email protected]>
update driver-if interface for smu 14.0.2/3

Signed-off-by: Kenneth Feng <[email protected]>
Reviewed-by: Yang Wang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
[Why] Coverity reports NULL_RETURN warning.

[How] Add pointer NULL check.

Reviewed-by: Rodrigo Siqueira <[email protected]>
Signed-off-by: Hersen Wu <[email protected]>
Tested-by: Daniel Wheeler <[email protected]>
Lijo Lazar and others added 26 commits July 12, 2024 14:37
Add debugfs nodes for enabling/disabling and tuning parameters used in
phase detect.

Signed-off-by: Lijo Lazar <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Add support for enabling phase detect and tuning params for SMUv13.0.6
SOCs.

Signed-off-by: Lijo Lazar <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Phase detect controls are only available for SMUv13.0.6 dGPUs. Create
control object only on those.

Signed-off-by: Lijo Lazar <[email protected]>
Reviewed-by: Feifei Xu <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
v3.x changed the how vram width was encoded.  The previous
implementation actually worked correctly for most boards.
Fix the implementation to work correctly everywhere.

This fixes the vram width reported in the kernel log on
some boards.

Reviewed-by: Hawking Zhang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Limiting the allocation of higher order pages to the closest NUMA node
and enabling direct memory reclaim provides not only failsafe against
situations when memory becomes too much fragmented and the allocator is
not able to satisfy the request from the local node but falls back to
remote pages (HUGEPAGE) but also offers performance improvement.
Accessing remote pages suffers due to bandwidth limitations and could be
avoided if memory becomes defragmented and in most cases without using
manual compaction. (/proc/sys/vm/compact_memory)

Note: On certain distros such as RHEL, the proactive compaction is
disabled. (https://tinyurl.com/4f32f7rs)

v2 (chk): drop __GFP_RECLAIM since that is already set by GFP_USER

Cc: Dave Airlie <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Daniel Vetter <[email protected]>
Reviewed-by: Christian König <[email protected]>
Signed-off-by: Rajneesh Bhardwaj <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
Signed-off-by: Christian König <[email protected]>
(cherry picked from commit 73534de04176394f72d0ca0e7990bf03addcf5ea)
Add support to get phase detect residency through debugfs

Signed-off-by: Lijo Lazar <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Add support to get phase detect residency information on SMUv13.0.6
SOCs.

Signed-off-by: Lijo Lazar <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Sysfs node disable query error count during gpu reset.

Signed-off-by: YiPeng Chai <[email protected]>
Reviewed-by: Stanley.Yang <[email protected]>
Cache the PCI state before bus master is disabled. The saved state is
later used for other cases like restoring config space after mode-2
reset.

Signed-off-by: Lijo Lazar <[email protected]>
Reviewed-by: Feifei Xu <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>

Fixes: 5c03e58 ("drm/amdgpu:add smu mode1/2 support for aldebaran")
To align with firmware, hbm id field 0x1 refers to
hbm stack 0, 0x2 refers to hbm statck 1.

Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
Data abort exception and unknown errors are supported.

Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
In the convenience of calling it globally.

Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Return RMA status without message print.

Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Instead of printing GPU reset failed.

v2: add check for reset_context->src.

Signed-off-by: Tao Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
KIQ timeouts no longer seen.

This reverts commit b4d12cc00ad69e8a0dea2ece7202bacfd8b894fb.

Signed-off-by: Victor Skvortsov <[email protected]>
Reviewed-by: Zhigang Luo <[email protected]>
VFs do not perform HW fini/suspend in FLR, so the dpm_enabled
is incorrectly kept enabled. Add interface to disable it in
virt_pre_reset call.

v2: Made implementation generic for all asics
v3: Re-order conditionals so PP_MP1_STATE_FLR is only evaluated on VF

Signed-off-by: Victor Skvortsov <[email protected]>
Reviewed-by: Lijo Lazar <[email protected]>
Register access from userspace should be blocked until
reset is complete.

Signed-off-by: Victor Skvortsov <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Stop waiting for the KIQ to return back when there is a reset pending.
It's quite likely that the KIQ will never response.

Signed-off-by: Koenig Christian <[email protected]>
Suggested-by: Lazar Lijo <[email protected]>
Tested-by: Victor Skvortsov <[email protected]>
Signed-off-by: Victor Skvortsov <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
[background] when unloading amdgpu driver right after running a
workload, drain_workqueue is causing "Fence fallback timer
expired on ring sdma0.0". Under sriov, this issue will cause sriov
full access timeout and a reset happening.

move drain_workqueue before shutdown is set to allow ih process and
before enter full access under sriov to avoid full access time cost.

Signed-off-by: Victor Zhao <[email protected]>
Reviewed-by: Feifei Xu <[email protected]>
Driver mode-2 is only supported by relative new
smc firmware.

Signed-off-by: Hawking Zhang <[email protected]>
Reviewed-by: Tao Zhou <[email protected]>
This was only used as workaround for recovering the page tables after
VRAM was lost and is no longer necessary after the function
amdgpu_vm_bo_reset_state_machine() started to do the same.

Compute never used shadows either, so the only proplematic case left is
SVM and that is most likely not recoverable in any way when VRAM is
lost.

Signed-off-by: Christian König <[email protected]>
Acked-by: Lijo Lazar <[email protected]>
Query pmfw version to determine if aux sos fw needs to be loaded
in psp v13.0.

v2: refine callback to check if aux_fw loading is needed instead of
    getting pmfw version barely
v3: return the comparison directly

Signed-off-by: Le Ma <[email protected]>
Reviewed-by: Lijo Lazar <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
To be compatible with legacy IFWI, driver needs to carry legacy tOS and
query pmfw version to load them accordingly.

Add psp_firmware_header_v2_1 to handle the combined sos binary.

Double the sos count limit for the case of aux sos fw packed.

v2: pass the correct fw_bin_desc to parse_sos_bin_descriptor

Signed-off-by: Le Ma <[email protected]>
Reviewed-by: Lijo Lazar <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
…ments

It's caused by v6.9-rc6-1554-g8a0a7b98d4b6
drm/mst: Fix NULL pointer dereference at drm_dp_add_payload_part2

Signed-off-by: Bob Zhou <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
In multiple GPUs case, after a GPU has started
resetting all GPUs on hive, other GPUs do not
need to trigger GPU reset again.

Signed-off-by: YiPeng Chai <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
…d_acquire()

In function amd_acquire(), kfd_get_process() is call to get process.
When judge whether we get an exception pointer, we shouldn't judge
whether it's a null pointer, because kfd_get_process will return
ERR_PTR(-EINVAL) instead of null pointer if error.

Because of this wrong logic, the kernel will panic then once
kfd_get_process() returns ERR_PTR(-EINVAL).

So, the correct logic should be:
	if (IS_ERR(p)) {

Fixes: commit 779b4d0("drm/amdkfd: Add RDMA and PeerDirect support")

Signed-off-by: Cruz Zhao <[email protected]>
@hkasivis
Copy link
Contributor

Looks good. We will merge it soon.

@whchung
Copy link

whchung commented Nov 19, 2024

@kentrussell
Copy link
Contributor

PeerDirect isn't upstreamable, so it cannot go to amd-staging-drm-next (which is what the amd-gfx mailing list is for). You'll notice that kfd_peerdirect.c isn't even present in the amd-staging-drm-next branch. However, I can confirm that this patch was picked internally, and will be in the upcoming ROCm release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Issue]: Incorrect exception handling logic in function amd_acquire()