Skip to content

Conversation

jiangliu
Copy link
Contributor

@jiangliu jiangliu commented Jan 3, 2025

When testing suspend/resume with AMDGPU device on bare metal servers, it fails to resume on the third time. Fix the issue by resetting the ASIC when needed.

Al Viro and others added 30 commits September 25, 2024 22:04
Signed-off-by: Al Viro <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
It's caused by eeab2428df5a886d40c08f9847ea4d2c2fbce7e0
"drm/amdgpu: fix a race in kfd_mem_export_dmabuf()"

Signed-off-by: Bob Zhou <[email protected]>
Reviewed-by: Asher Song <[email protected]>
If buddy manager have more than one roots and each root have sub-block
need to be free. When drm_buddy_fini called, the first loop of
force_merge will merge and free all of the sub block of first root,
which offset is 0x0 and size is biggest(more than have of the mm size).
In subsequent force_merge rounds, if we use 0 as start and use remaining
mm size as end, the block of other roots will be skipped in
__force_merge function. It will cause the other roots can not be freed.

Solution: use roots' offset as the start could fix this issue.

Signed-off-by: Lin.Cao <[email protected]>
Reviewed-by: Arunpravin Paneer Selvam <[email protected]>
This resolves the unchecded return value warning reported by Coverity.

Signed-off-by: Tim Huang <[email protected]>
Reviewed-by: Jesse Zhang <[email protected]>
Currently, the code uses the IH_VMID_X_LUT register to map
a queue's vmid to the corresponding PASID. This logic is racy
since CP can update the VMID-PASID mapping anytime especially
when there are more processes than number of vmids. Update the
logic to calculate CU occupancy by matching doorbell offset of
the queue with valid wave counts against the process's queues.

Signed-off-by: Mukul Joshi <[email protected]>
Reviewed-by: Harish Kasiviswanathan <[email protected]>
Make CU occupancy calculations work on GFX 9.4.3 by
updating the logic to handle multiple XCCs correctly.

Signed-off-by: Mukul Joshi <[email protected]>
Reviewed-by: Harish Kasiviswanathan <[email protected]>
v1 - remove cs parse code (Christian)

On VCN v4_0_6 AV1 is supported on both the instances.
Remove cs IB parse code since explict handling of AV1 schedule is
not required.

Signed-off-by: Saleemkhan Jamadar <[email protected]>
Reviewed-by: Leo Liu <[email protected]>
Move the reinitialization part after a reset to another function. No
functional changes.

Signed-off-by: Lijo Lazar <[email protected]>
Reviewed-by: Feifei Xu <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Acked-by: Rajneesh Bhardwaj <[email protected]>
Tested-by: Rajneesh Bhardwaj <[email protected]>
To handle amdgpu_device reference for different GPUs
we add it's reference in each ip block which can be
used to differentiate between difference gpu devices.

Signed-off-by: Sunil Khatri <[email protected]>
Suggested-by: Christian König <[email protected]>
Reviewed-by: Christian König <[email protected]>
program SDMAx_QUEUEx_SCHEDULE_CNTL for context switch due to
quantum in KFD for GFX12.

Signed-off-by: Sreekant Somasekharan <[email protected]>
Reviewed-by: Harish Kasiviswanathan <[email protected]>
Need to set the pipe reset and cache invalidation bits
on halt otherwise we can get stale state if the CP firmware
changes (e.g., on module unload and reload).

Reviewed-by: Srinivasan Shanmugam <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
There are some spelling mistakes of 'acccess' in comments which
should be instead of 'access'.

And the comment style should be like this:
 /*
  * Text
  * Text
  */

Suggested-by: Christian König <[email protected]>
Link: https://lore.kernel.org/all/[email protected]/
Acked-by: Thomas Zimmermann <[email protected]>
Link: https://lore.kernel.org/all/[email protected]/
Reviewed-by: Christian König <[email protected]>
Signed-off-by: WangYuli <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Add helper to get supported/available partition config modes

Signed-off-by: Lijo Lazar <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Add a callback interface to get the resource information of a partition
mode. Presently the information has number of resources and number of
entities sharing the resource.

Add the implementation for aquavanjaram SOCs.

Signed-off-by: Lijo Lazar <[email protected]>
Signed-off-by: Asad Kamal <[email protected]>
Reviewed-by: Hawking Zhang <[email protected]>
Need to make sure it's halted as we don't know what state
the GPU may have been left in previously.

Reviewed-by: Srinivasan Shanmugam <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Fix several copypaste mistakes in *_disable_link_output() functions where
an improper function pointer is checked before dereference.

Found by Linux Verification Center (linuxtesting.org) with Svace.

Signed-off-by: Vitaliy Shevtsov <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
In some cases, device needs to be reset before first use. Add handlers
for doing device reset during driver init sequence.

Signed-off-by: Lijo Lazar <[email protected]>
Reviewed-by: Feifei Xu <[email protected]>
Acked-by: Rajneesh Bhardwaj <[email protected]>
Tested-by: Rajneesh Bhardwaj <[email protected]>
Without setting dcc bit, there is ramdon PTE copy corruption on sdma 7.

so add this bit and update the packet format accordingly.

Signed-off-by: Frank Min <[email protected]>
Reviewed-by: Christian König <[email protected]>
SR-IOV fetches the vbios from VRAM in some cases.
Re-enable the VRAM path for dGPUs and rename the function
to make it clear that it is not IGP specific.

Fixes: 042658d ("drm/amdgpu: clean up vbios fetching code")
Reviewed-by: Yang Wang <[email protected]>
Tested-by: Yang Wang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
SR-IOV fetches the vbios from VRAM in some cases.
Re-enable the VRAM path for dGPUs and rename the function
to make it clear that it is not IGP specific.

Fixes: 042658d ("drm/amdgpu: clean up vbios fetching code")
Reviewed-by: Yang Wang <[email protected]>
Tested-by: Yang Wang <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
program SDMAx_QUEUEx_SCHEDULE_CNTL for context switch due to
quantum in KFD for GFX12.

Signed-off-by: Sreekant Somasekharan <[email protected]>
Reviewed-by: Harish Kasiviswanathan <[email protected]>
… driver

Root cause: Parameter type for amdgpu_amdkfd_free_gtt_mem() has changed.
Fix: Change SPM code to match this change.

Signed-off-by: Bing Ma <[email protected]>
Reviewed-by: James Zhu <[email protected]>
We should start SPM only after all SPM configurations are done, otherwise
we might see garbage data or other undefined behaviors. Because user mode
module (profiler) is responsible for SPM configurations, we will let user
mode module to start SPM.

Signed-off-by: Bing Ma <[email protected]>
Acked-by: James Zhu <[email protected]>
When SPM is reset, RLC automatically resets wptr to 0. We need to manually
reset rptr to match this.

Signed-off-by: Bing Ma <[email protected]>
Reviewed-by: James Zhu <[email protected]>
…ue'.

We cannot set rptr = wptr here, because wptr is always set at segment
boundary and profiler uses this knowledge to parse SPM counters. But rptr
is not always set at segment boundary, and if we force 'rptr = wptr', we
might leave an incomplete segment to user mode profiler and profiler won't
be able to parse the counter properly.

Signed-off-by: Bing Ma <[email protected]>
Acked-by: James Zhu <[email protected]>
…) into one function spm_update_dest_info()

The gap between the two functions will trigger unnecessary data loss
condition in kfd_spm_read_ring_buffer().

Signed-off-by: Bing Ma <[email protected]>
Reviewed-by: James Zhu <[email protected]>
amdgpu_device_ip_is_idle is unused.
It was renamed from 'amdgpu_is_idle' which was originally added in
commit 5dbbb60 ("drm/amdgpu: add IP helpers for wait_for_idle and is_idle")

but hasn't been used.

Remove it.

Signed-off-by: Dr. David Alan Gilbert <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
amdgpu_atpx_dgpu_req_power_for_displays has been unused since
commit bdb1ccb ("drm/amdgpu: remove ATPX_DGPU_REQ_POWER_FOR_DISPLAYS
check when hotplug-in")

amdgpu_atpx_get_dhandle has been unused since commit
f9b7f37 ("drm/amdgpu/acpi: make ATPX/ATCS structures global (v2)")

Remove them.

Signed-off-by: Dr. David Alan Gilbert <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Nicholas Kazlauskas and others added 24 commits December 6, 2024 09:46
[Why]
SEC_CNTL isn't readable by x86 and can block Z8 entry if read.

[How]
Remove the read.

Signed-off-by: Nicholas Kazlauskas <[email protected]>
[Why]
With 3 MST Asus ProArt MST display config in compliance test, couple
of displays enumerated with side band messages in parallel, resulting
in side band message timeouts.

[How]
Forcing MST blocked discovery for Asus ProArt monitor by adding
monitor patch.

[CLEANED] Adding flag for forced MST blocked discovery

[Why]
Need a flag to force MST blocked discovery for certain branch devices.

[How]
Added a flag to force MST blocked discovery in struct dc_panel_patch.

Signed-off-by: Meenakshikumar Somasundaram <[email protected]>
This cherry-picks the following commit to 24.30:
	- 03f2fd76f82d7e6fe52a71e11d7dedca5f46d1db

[why & how]
There are cases where an OTG is remapped from driving a regular HDMI
display to a DP/eDP display. There are also cases where DTBCLK needs to be
enabled for HPO, but DTBCLK DTO programming may be done while OTG is still
enabled which is dangerous as the PIPE_DTO_SRC_SEL programming may change
the pixel clock generator source for a mapped and running OTG and cause it
to hang.

Remove the PIPE_DTO_SRC_SEL programming from this sequence since it is
already done in program_pixel_clk(). Additionally, make sure that
program_pixel_clk sets DTBCLK DTO as source for special HDMI cases.

Signed-off-by: Ovidiu Bunea <[email protected]>
[Why]
No check on head pipe during the dml to dc hw mapping will allow illegal pipe usage. This
will result in a wrong pipe topology to cause mpcc tree totally mess up then cause system TDR.

[How]
Avoid to use the pipe is head in all check and avoid ODM slice during preferred pipe check.

Signed-off-by: Yihan Zhu <[email protected]>
[WHY & HOW]
Added pipe type check for DPP pipe type before executing head pipe check
in the pipe selection logic in DML2 to avoid NULL pointer de-reference.

Signed-off-by: Yihan Zhu <[email protected]>
… odm

[Why]
On some cards when odm is used, the monitor will have 2 separate pipes split
vertically. When compression is used on the YCbCr colour space on the second pipe to have
correct colours, we need to read a pixel from the end of first pipe to
accurately display colours. Hardware was programmed properly to account
for this extra pixel but it was not calculated properly in software
causing a split screen on some monitors.

[How]
The fix adjusts the second pipe's viewport and timings if the pixel encoding is
YCbCr422 or YCbCr420.

Signed-off-by: Peterson <[email protected]>
[Why & How]
The GCP is not required to be transmitted when video
stream is compressed accroding to the HDMI2.1
specification, and skip it.

Signed-off-by: Zhikai Zhai <[email protected]>
Cherry-pick from ec0ab6b5.

[Why]
The minimum value of the dst_y_prefetch_equ was not correct
in prefetch calculation whice causes OPTC underflow.

[How]
Add the min operation of dst_y_prefetch_equ in prefetch calculation.

Signed-off-by: loanchen <[email protected]>
Signed-off-by: Hugo Hu <[email protected]>
[WHY & HOW]
Cursor corruption observed on USBC display only when system setup Eyefinity with a reboot. Cursor memory might still
in the lightsleep state due to voltage issue, we need program DISPCLK_R_GATE_DISABLE to avoid this issue only on DCN35.

port commit id: f7101578789

Signed-off-by: Yihan Zhu <[email protected]>
Signed-off-by: Fudong Wang <[email protected]>
[why & how]
Garbage will show due to dig is on. So blank stream needed.

port commit id: 1e8f0e34c300

Signed-off-by: Fudongwang <[email protected]>
[Why]
In commit_planes_and_stream_update_with_new_context, it is possible to
encounter a scenario where a plane with only one reference is retrieved
in dc_plane_get_plane_configs, then this plane is released during the
minimal transition, then in reconfigure_hwfq there's a crash due to the
plane's memory having been freed. Since dc_plane_get_plane_configs is
creating new references to existing planes, the planes should be retained
and released accordingly, which should resolve the issue above. However,
doing so exposed another issue, the backup / restore planes mechanism
doesn't maintain current refcount, which can also cause crashes if the
refcount changes in between backup and restore operations.

[How]
 - retain planes in dc_plane_get_plane_configs
 - add new function dc_plane_release_plane_configs
 - release planes in callers of dc_plane_get_plane_configs where needed
 - cache and re-apply current refcount when restoring planes

Signed-off-by: Joshua Aberback <[email protected]>
[WHY&HOW]
Hardware does not support the VTotal to be between fp2 lines of the
maximum possible VTotal, so add a capability flag to track it and apply
where necessary.

Signed-off-by: Dillon Varone <[email protected]>
[Why]
SPL code forces taps to 1 when ratio is 1:1 and sharpness is off
But for chroma 1:1, need taps > 1 to handle cositing
EASF only applies to luma.  Previously was checking both
 luma and chroma taps to determine whether to enable EASF

[How]
Do not force chroma taps to 1 when ratio is 1:1 for YUV420
Remove 420_CHROMA_BYPASS mode for scaler
Only check if luma taps are supported before determine
 whether to enable EASF or not

Signed-off-by: Samson Tam <[email protected]>
[WHY]

Soft hang/lag observed during 10bit playback + moving cursor, corruption
observed in other tickets for same reason, also failing MPO.

1. Currently, we are always running
   calculate_lowest_supported_state_for_temp_read which is only
   necessary on dGPU
2. Fast validate path does not apply DET buffer allocation policy
3. Prefetch UrgBFactor chroma parameter not populated in prefetch
   calculation

[HOW]
1. Add a check to see if we are on APU, if so, skip the code
2. Add det buffer alloc policy checks to fast validate path
3. Populate UrgentBurstChroma param in call to calculate
   UrgBChroma prefetch values

-revision commits: small formatting/brackets/null check addition + remove test change + dGPU code

Signed-off-by: Ausef Yousof <[email protected]>
Signed-off-by: loanchen <[email protected]>
"build failure after merge of the amdgpu tree"
dm_suspend/dm_resume functions argument mismatch
not caught in validation as it was under config
CONFIG_DEBUG_KERNEL_DC which wasnt enabled by
default.

Change argument from adev to ip_block.

Signed-off-by: Sunil Khatri <[email protected]>
Acked-by: Christian König <[email protected]>
(cherry picked from commit 0d1c554f95e555bc5164e73caafb3f2b243c87e2)
remove the duplicate ip_block object in the
isp_hw_init function.

Signed-off-by: Sunil Khatri <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
(cherry picked from commit be64d3a4cbc63cc5e38c56392c04ae207c7c153c)
Split resume into a 3rd step to handle displays when DCC is
enabled on DCN 4.0.1.  Move display after the buffer funcs
have been re-enabled so that the GPU will do the move and
properly set the DCC metadata for DCN.

v2: fix fence irq resume ordering

Reviewed-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
(cherry picked from commit 32a8dc1b252b9066366a27d3d9ce571278212418)
Under sriov, host driver will save and restore vf pci cfg space during
reset. And during device init, under sriov, pci_restore_state happens after
fullaccess released, and it can have race condition with mmio protection
enable from host side leading to missing interrupts.

So skip amdgpu_device_cache_pci_state for sriov.

Signed-off-by: Victor Zhao <[email protected]>
Acked-by: Lijo Lazar <[email protected]>
In a consecutive packet submission, for example unmap and query status,
when CP is reading wptr caused by unmap packet doorbell ring, if in some
case CP operates slower (e.g. doorbell_mode=1) and wptr has been updated
to next packet (query status), but the query status packet content has
not been flushed to memory yet, it will cause CP fetched stalled data.

Adding mb to ensure ring buffer has been updated before updating wptr.
Also adding a mb to ensure wptr updated before doorbell ring.

Signed-off-by: Victor Zhao <[email protected]>
Reviewed-by: Christian König <[email protected]>
In SRIOV, when host driver performs MODE 1 reset and notifies FLR to
guest driver, there is a small chance that there is no job running on hw
but the driver has not updated the pending list yet, causing the driver
not respond the FLR request. Modify the has_job_running function to
make sure if there is still running job.

v2: Use amdgpu_fence_count_emitted to determine job running status.
v3: Remove the timeout wait in has_job_running

Signed-off-by: Emily Deng <[email protected]>
Signed-off-by: Shikang Fan <[email protected]>
Reviewed-by: Christian König <[email protected]>
Some boards use longer File Ids.

Signed-off-by: Lijo Lazar <[email protected]>
Reviewed-by: Asad Kamal <[email protected]>
…ers via mes.

The currect code use the address "adev->mes.read_val_ptr" to
store the value read from register via mes.
So when multiple threads read register,
multiple threads have to share the one address,
and overwrite the value each other.

Assign an address by "amdgpu_device_wb_get" to store register value.
each thread will has an address to store register value.

Signed-off-by: chongli2 <[email protected]>
Reviewed-by: Emily Deng <[email protected]>
Reviewed-by: Christian König <[email protected]>
When GPU suspend is aborted, do the same for dGPU as APU to reset
soc15 asic. Otherwise it may cause following errors:
[  547.229463] amdgpu 0001:81:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_0.2.1.0 test failed (-110)

[  555.126827] amdgpu 0000:0a:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_0.2.1.0 test failed (-110)
[  555.126901] [drm:amdgpu_gfx_enable_kcq [amdgpu]] *ERROR* KCQ enable failed
[  555.126957] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v9_4_3> failed -110
[  555.126959] amdgpu 0000:0a:00.0: amdgpu: amdgpu_device_ip_resume failed (-110).
[  555.126965] PM: dpm_run_callback(): pci_pm_resume+0x0/0xe0 returns -110
[  555.126966] PM: Device 0000:0a:00.0 failed to resume async: error -110

Signed-off-by: Jiang Liu <[email protected]>
Tested-by: Shuo Liu <[email protected]>
commit 9cef84b
drm/amdgpu: update suspend status for aborting from deeper suspend

There're some other suspend abort cases which can call the noirq
suspend except for executing _S3 method. In those cases need to
process as incomplete suspendsion.

Signed-off-by: Jiang Liu <[email protected]>
@superm1
Copy link
Contributor

superm1 commented Jan 9, 2025

Can you please bring these to the amd-gfx M/L? Kernel patches are reviewed there.

@jiangliu
Copy link
Contributor Author

Can you please bring these to the amd-gfx M/L? Kernel patches are reviewed there.

Sure, will do that.
This patchset only applies to this repo and conflicts with amd-staging-drm-next, is it work to send it to amd-gfx maillist?

@superm1
Copy link
Contributor

superm1 commented Jan 10, 2025

Can you please rebase and adjust conflicts on AMD staging drm next?

This is the way all new changes start. We can do a backport to the dkms and other branches after it's landed.

@jiangliu
Copy link
Contributor Author

Can you please rebase and adjust conflicts on AMD staging drm next?

This is the way all new changes start. We can do a backport to the dkms and other branches after it's landed.

The code logic is different on these two repos, and this change only applies to this repo. The amd-staging-drm-next repo has different code base here, so I can't rebase to it:(

@superm1
Copy link
Contributor

superm1 commented Jan 10, 2025

Yesh the logic changed in newer kernel. I believe the specific commit in question is torvalds/linux@d5e3d8a.

If you port that to this branch does it work properly?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.