Increase private segment limit for dynamic scratch kernels #80

publixsubfan · 2024-05-09T02:09:18Z

Check dynamic scratch allocations against the per-wave scratch limit in ROCR-Runtime, which is set to 8MB/warp. This is an increase in the original per-thread private segment limit from 16KiB to almost 128KiB.

A reproducer to observe this issue is here: https://gist.github.com/publixsubfan/f5fcfe9f3d826c45a80e23acf5d88de2

Details

The ROCR-Runtime repo defines the following limit for per-wave scratch memory:
https://github.com/ROCm/ROCR-Runtime/blob/master/src/core/runtime/amd_gpu_agent.cpp#L84

#define MAX_WAVE_SCRATCH 8387584  // See COMPUTE_TMPRING_SIZE.WAVESIZE

Likewise, LLVM specifies the following scratch limits in this chunk of code here: https://github.com/ROCm/llvm-project/blob/d0f9aa6415cde2f7b9bc6dbf385b5c77b700edec/llvm/lib/Target/AMDGPU/GCNSubtarget.h#L308-L320

(2^13-1) * 1 KiB for GFX10 and below, this matches the MAX_WAVE_SCRATCH value
(2^15-1) * 256 B for GFX11, this is a little larger than MAX_WAVE_SCRATCH
(2^18-1) * 256B for GFX12 and above, which is a little under 64MB/wave.

PAL optimized the logic for the barriers, which caused failures with CP DMA on Navi4x. Change barrier's code to match the most recent PAL optimizations. Change-Id: I55eeab20f51eb8e920bcbb4b55fbe3c7f77fd3fa

Change-Id: If1bcca45825c9899462bb95ed6f637f5af806cc8

This no longer does anything. Change-Id: I0643198a46a534a76454a5b461d010ed1776a89a

Added call to hipDeviceSynchronize in __hipUnregisterFatBinary to ensure that all HSA async signal handlers complete before removeFatBinary Change-Id: I756fecca1c2a5eae092613d8079de266399e5685

… elementSize should be 1 as width is in bytes while capturing hipMemset2DAsync. Change-Id: I8f9122a30cba0a07c097dfd7609432090caab142

This reverts commit 5f68a45. Reason for revert: due to windows staging branch using Opencl-icd-loader master Change-Id: I9cca7564a21de1733665a34da6f0322aa3b886e7

…function to be called at exit" This reverts commit 5e294f8. Change-Id: Ib9cb1cc0c3903bfba56c9a5d05ae8afe96be583a

Under ROCr physical allocations don't have initial VA and require extra flag in ROCclr. Add an option to have a mempool of physical allocations. Change-Id: I4d062fe0dd8113d4eaf6e8b51749ed56d8701d1e

Change-Id: I33d1359d5e4c871f63350d8300f726e039664d86

Change-Id: I7b58177c41dc0c6c59813977cb90e65a6cb3be72

Change-Id: Ica32017ef7b00326dfb6d1f604e126d40ad5b786

Change-Id: Ib479c744b90125b74d99cbf18b7f4b8cf765bf1c

- With https://gerrit-git.amd.com/c/lightning/ec/llvm-project/+/1002628 applied, at -O0 Kernel::dynamicParallelism() returns true but virtual queue isn't created - This causes segfault inside VirtualGPU::submitKernelInternal() when getVQVirtualAddress() is called Change-Id: Ia7af042adad2329e870c142caaac3e8fa886f8b8

- Create a vector to allow multiple TS to be stored in Command. - This would mean we dont wait for entire batch in Accumulate command to finish when we exhaust signals. - Reduce the number of signals created at init to 64. This min value may still need to be tuned but the KFD allows max of 4094 interrupt signals per device. - Store kernel names whenever they are available and not just when profiling. If we dynamically enable profiling like for Torch, a crash can happen if hipGraphInstantiate wasnt included in Torch profile scope beacuse we previously entered kernel names only when profiler is attached. Change-Id: I34e7881a25bbc763f82fdeb3408a8ea58e1ec006

…ead is being destroied by app/test's unload libamdhip64.so call. Change-Id: I8d4a8d8b6801d9f6eb745c45adf831597def0cb5

alwaysResident setting doesn't require per queue residency tracking. Thus, the logic can be skipped to avoid the lock of queues. Change-Id: Ib5cff5b79d3ecb8c2f2eb2565cf069f9a69438b0

The new logic has a lock for PAL call and doesn't require the lock for queues. Change-Id: I61b67c3c4abd2ede44809de1d6beed756766032e

… during capture => hipDeviceSynchronize is not allowed during capture. => hipEventSynchronize during capture should return hipErrorCapturedEvent error => hipEventQuery during capture should return hipErrorCapturedEvent error hipStreamSynchronize, hipEventSynchronize, hipStreamWaitEvent, hipStreamQuery For Side Stream(Stream that is not currently under capture): => If current thread is capturing in relaxed mode, calls are allowed => If any stream in current/concurrent thread is capturing in global mode, calls are not allowed => If any stream in current thread is capturing in ThreadLocal mode, calls are not allowed For Stream that is currently under capture => calls are not allowed => Any call that is not allowed during capture invalidates the capture sequence => It is invalid to call synchronous APIs during capture. Synchronous APIs, such as hipMemcpy(), enqueue work to the legacy stream and synchronize it before returning. Change-Id: I201c6e63e1a5d93fd416a3b520264c0fdbe31237

Change-Id: Id58f982edd4f17d675f7a0f61a9b4dea0baebd9b

Change-Id: Icbe67024297c92bf59139b6a2ccd2ba3674f60b1

Add cltrace compile definition for CL_TARGET_OPENCL_VERSION to OpenCL 2.2 Change-Id: Ie868ab0a6e86951afc6d07da58be942c3b736d15

When large bar is enabled, persistent memory leads to overallocation for 32 bit architecture. Change-Id: Iae39359d8128588de02e42d77fe58e868b8e71fd

- Application is passing null for parameter stream in API hipStreamWaitEvent - When event stream isn't capturing and event is not recorded, causes segfault because we are accessing deviceId() from waitStream Change-Id: I8b87ffd6f234677f68b66dcb7ef44b2ff04a7c91

…hGenericAqlPacket Dispatching multiple packets with ring the doorbell once is not supported by the lower layers Change-Id: I7665a2dcdd4ef9e47dadfe410180fed64c5a4ee0

Change-Id: Iadcdadd734e7aeeb23742e426353defa972d3ad5

If we are using the mask returned by getLastUsedSdmaEngine() then we need to apply the SDMA Read/Write mask to it before using with HSA copy_on_engine API. Change-Id: I6e5dc6c187eeb3c61ee159e9d2a0fa7b4737c06e

This also brings bfloat16 implementation closer to CUDA's. Change-Id: I23f381141faacd6537923ae9b88ada4d661db496

Change-Id: Ic41b1ad1b64cca0e31986337a83a5146d52a7328

Add kernel arguments optimization into blit path. Enabled by default on MI300. Change-Id: I2694a81b90d48ad07d86dfe4c0c64fe187bada8e

…lti devices and chunk decommit gets delayed. Change-Id: Ia4b0d5fbfa8f198776e52d14de8b22c6942f740d

Change-Id: Id23882286cb2a0d0472964ffc501ab27b7dc7f00

Signed-off-by: shadi <[email protected]> Change-Id: Iff3ee7dcbcd24836f227fdc9bd5ff4b554ac914f

Change-Id: Ie3f3c0bcea84368c1b0607fd52b4bc7cae41c512

Make sure graph mempool unmaps VA on release Change-Id: Id3f1bd8d0115b533ae60aa5ba3676b8bf7e5b961

…a single virtual address. Change-Id: Ie678e607a64f2e5c35a10b9083185f041c5527ac

The .bat file will not be removed from windows pacakge. Also used cmake install(PROGRAMS …), which will set the correct permission rather than hard coding the permission Change-Id: I8b57778b59f70e01de949be2ea353b67eec70d2f

Change-Id: Iedd6290a813d6e43a4350709484f78e05b08adc8

Change-Id: I1f8411eae9ed49632667e244a25f223fed92c720

Change-Id: I7fda94e61121f9d3a30f4ad185b8a97712922f3c

Change-Id: Ief73138faed0af70b90186db5bde6689e0a83f88

…ss the flag to HSA API. Change-Id: I1bafeaa3096395c729723af958d609bc41e7845c

- Add LOG_TS mask for printing signal times - Read raw ticks from signals Change-Id: Ibdd0bf06c790729f6c65083a4784c97a3c3219e0

- Fix possible buffer overflow for long kernel names Change-Id: I3c51669de7ff242d03f9210ee045b6d5e7ac274a

Change-Id: I820cacd75a36363d1387e2e881c96937281bc265

Change-Id: Id3cb4b58b9efd3aceec4377d4d1d3a053c50333e

Change-Id: I3601373f680aa4bff0075f4b3b9e885e54b4600d

1.Make runtime use comgr to unbundle code objects 2.Support compressed/uncompressed modes 3.Remove HIP_USE_RUNTIME_UNBUNDLER and HIPRTC_USE_RUNTIME_UNBUNDLER to simplify logics 4.Add comgr wrapper for amd_comgr_action_info_set_bundle_entry_ids() Change-Id: Ic41b1ad1b64cca1e31986437983a5146d52a7329

The Readback and Avoid HDP Flush memory ordering workaround is used as a fallback solution only when HDP flush register is invalid Change-Id: Ic284eba1f95ed22b0270d3abeb904fb902015b1a

Unbundle compressed code objects needs comgr 2.8 or higher Change-Id: I23942d2038e19b02c3ea5d3c9c1fe5367db87136

…ltiple translation unit. Change-Id: I92179ad198abbdaf5aec9c3c4ba76eeb6b0cc761

Change-Id: I21abe109ddfabfe7640bf78a96c81a1317d31952

Change-Id: Iabb4071bb77201576bc2c0488a04f4fa188815df

Change-Id: I374ea7c3119b0c61f9846a862c4a448ddb179748

Switch commands creation to the new suballocator to avoid frequent expensive OS calls Change-Id: I3597c811820e577c15708bad8b8a41aa53acc400

gandryey and others added 30 commits March 18, 2024 10:52

SWDEV-449558 - Update barrier's logic

1239309

PAL optimized the logic for the barriers, which caused failures with CP DMA on Navi4x. Change barrier's code to match the most recent PAL optimizations. Change-Id: I55eeab20f51eb8e920bcbb4b55fbe3c7f77fd3fa

SWDEV-430437,SWDEV-434702 - Split the streamset per device

be2bdab

Change-Id: If1bcca45825c9899462bb95ed6f637f5af806cc8

SWDEV-1 - Do not pass daz_opt to comgr link options

67473ba

This no longer does anything. Change-Id: I0643198a46a534a76454a5b461d010ed1776a89a

SWDEV-449146 - Call hipDeviceSynchronize before removeFatBinary

ad32e60

Added call to hipDeviceSynchronize in __hipUnregisterFatBinary to ensure that all HSA async signal handlers complete before removeFatBinary Change-Id: I756fecca1c2a5eae092613d8079de266399e5685

SWDEV-452299 - Pass dst pitch while capturing hipMemcpyParam2DAsync &…

0be92b8

… elementSize should be 1 as width is in bytes while capturing hipMemset2DAsync. Change-Id: I8f9122a30cba0a07c097dfd7609432090caab142

Revert "SWDEV-444098 - remove rocm-ocl-icd packaging"

f44ca70

This reverts commit 5f68a45. Reason for revert: due to windows staging branch using Opencl-icd-loader master Change-Id: I9cca7564a21de1733665a34da6f0322aa3b886e7

SWDEV-451736 - Revert "SWDEV-444670 - Register the Runtime::tearDown …

51926b6

…function to be called at exit" This reverts commit 5e294f8. Change-Id: Ib9cb1cc0c3903bfba56c9a5d05ae8afe96be583a

SWDEV-353281 - Change pool type for graphs

f296159

Under ROCr physical allocations don't have initial VA and require extra flag in ROCclr. Add an option to have a mempool of physical allocations. Change-Id: I4d062fe0dd8113d4eaf6e8b51749ed56d8701d1e

SWDEV-301667 - Disable HostBlit copy for HIP correct if check

7f84df9

Change-Id: I33d1359d5e4c871f63350d8300f726e039664d86

SWDEV-448288 - Remove references to deprecated llvm references.

5b28e38

Change-Id: I7b58177c41dc0c6c59813977cb90e65a6cb3be72

SWDEV-413997 - Save hsa_handle as ptr in hipMemCreate path.

5cbd74b

Change-Id: Ica32017ef7b00326dfb6d1f604e126d40ad5b786

SWDEV-446992 - Request can be for bytes OR dptr OR both.

09328f4

Change-Id: Ib479c744b90125b74d99cbf18b7f4b8cf765bf1c

SWDEV-450636 - Hostcall Listen thread cause seg fault due to main thr…

4761ecb

…ead is being destroied by app/test's unload libamdhip64.so call. Change-Id: I8d4a8d8b6801d9f6eb745c45adf831597def0cb5

SWDEV-449922 - Remove per queue residency update

95e3958

alwaysResident setting doesn't require per queue residency tracking. Thus, the logic can be skipped to avoid the lock of queues. Change-Id: Ib5cff5b79d3ecb8c2f2eb2565cf069f9a69438b0

SWDEV-449922 - Remove a global lock for queues on wait

2f3ad43

The new logic has a lock for PAL call and doesn't require the lock for queues. Change-Id: I61b67c3c4abd2ede44809de1d6beed756766032e

SWDEV-452787 - correct hipDrvGraphAddMemcpyNode check

ea4f09e

Change-Id: Id58f982edd4f17d675f7a0f61a9b4dea0baebd9b

SWDEV-451687 - Fix alloc message values in AMD_LOG_LEVEL for 32 bit

411960a

Change-Id: Icbe67024297c92bf59139b6a2ccd2ba3674f60b1

SWDEV-443537 - fix make build warning message

f7b1398

Add cltrace compile definition for CL_TARGET_OPENCL_VERSION to OpenCL 2.2 Change-Id: Ie868ab0a6e86951afc6d07da58be942c3b736d15

SWDEV-446726 - Disable large bar for 32 bit windows

bc80802

When large bar is enabled, persistent memory leads to overallocation for 32 bit architecture. Change-Id: Iae39359d8128588de02e42d77fe58e868b8e71fd

SWDEV-453301 - Remove the option to write multiple packets in dispatc…

d7f352d

…hGenericAqlPacket Dispatching multiple packets with ring the doorbell once is not supported by the lower layers Change-Id: I7665a2dcdd4ef9e47dadfe410180fed64c5a4ee0

SWDEV-451964 - Limit gpu single allocation percentage for gfx940 only

dbac297

Change-Id: Iadcdadd734e7aeeb23742e426353defa972d3ad5

SWDEV-301667 - Fix SDMA mask reuse

3f0bcf7

If we are using the mask returned by getLastUsedSdmaEngine() then we need to apply the SDMA Read/Write mask to it before using with HSA copy_on_engine API. Change-Id: I6e5dc6c187eeb3c61ee159e9d2a0fa7b4737c06e

SWDEV-379007 - add __hip_bfloat16_raw types

481912a

This also brings bfloat16 implementation closer to CUDA's. Change-Id: I23f381141faacd6537923ae9b88ada4d661db496

SWDEV-446298 - Adding error code to the logs on p2p hsa api failure.

2b8634b

Change-Id: Ic41b1ad1b64cca0e31986337a83a5146d52a7328

SWDEV-455254 - Add kernel arg optimization

f0c7ecf

Add kernel arguments optimization into blit path. Enabled by default on MI300. Change-Id: I2694a81b90d48ad07d86dfe4c0c64fe187bada8e

SWDEV-436754 - Use glbctx instead so that ref count increments for mu…

d52168b

…lti devices and chunk decommit gets delayed. Change-Id: Ia4b0d5fbfa8f198776e52d14de8b22c6942f740d

gandryey and others added 25 commits April 25, 2024 11:24

SWDEV-447691 - Correct handle type for DX12 semaphore

9fdddb7

Change-Id: Id23882286cb2a0d0472964ffc501ab27b7dc7f00

SWDEV-420016 - Add more driver side graph APIs

f2b0178

Signed-off-by: shadi <[email protected]> Change-Id: Iff3ee7dcbcd24836f227fdc9bd5ff4b554ac914f

SWDEV-457859 - Initialize isWGPMode_ in WorkGroupInfo

cffff4e

Change-Id: Ie3f3c0bcea84368c1b0607fd52b4bc7cae41c512

SWDEV-353281 - Corret VA unmap

5c1804a

Make sure graph mempool unmaps VA on release Change-Id: Id3f1bd8d0115b533ae60aa5ba3676b8bf7e5b961

SWDEV-413997 - Handling cases where multiple phys_mem is mapped into …

49b4aef

…a single virtual address. Change-Id: Ie678e607a64f2e5c35a10b9083185f041c5527ac

SWDEV-451004 - Remove .bat files from hip-dev/devel package

9e74f6d

The .bat file will not be removed from windows pacakge. Also used cmake install(PROGRAMS …), which will set the correct permission rather than hard coding the permission Change-Id: I8b57778b59f70e01de949be2ea353b67eec70d2f

SWDEV-458516 - Add support for external CI builds using Azure Pipelines

d7d1e01

Change-Id: Iedd6290a813d6e43a4350709484f78e05b08adc8

SWDEV-413997 - Changes to use GlobalContext in views.

0e1a057

Change-Id: I1f8411eae9ed49632667e244a25f223fed92c720

SWDEV-311271 - Enable mempools under Linux

7a37150

Change-Id: I7fda94e61121f9d3a30f4ad185b8a97712922f3c

SWDEV-459583 - Update codeownder in clr repos

08889f4

Change-Id: Ief73138faed0af70b90186db5bde6689e0a83f88

SWDEV-456279 - Adding new hip flag to access contiguous memory and pa…

1d48f2a

…ss the flag to HSA API. Change-Id: I1bafeaa3096395c729723af958d609bc41e7845c

SWDEV-301667 - Add LOG_TS mask

948ca5a

- Add LOG_TS mask for printing signal times - Read raw ticks from signals Change-Id: Ibdd0bf06c790729f6c65083a4784c97a3c3219e0

SWDEV 301667 - Fix Debug dot print for graphs

f1ef05b

- Fix possible buffer overflow for long kernel names Change-Id: I3c51669de7ff242d03f9210ee045b6d5e7ac274a

SWDEV-455586 - create ocltst ASAN test instrumented pkg

feaef7f

Change-Id: I820cacd75a36363d1387e2e881c96937281bc265

[SWDEV-442583] - Fix hipLaunchCooperativeKernel API error codes

59051ed

Change-Id: Id3cb4b58b9efd3aceec4377d4d1d3a053c50333e

SWDEV-449327 - Adding CONTRIBUTING.md in clr repos

996c16a

Change-Id: I3601373f680aa4bff0075f4b3b9e885e54b4600d

SWDEV-451594 - Change device kernel args to use HDP flush by default

6cb7b6e

The Readback and Avoid HDP Flush memory ordering workaround is used as a fallback solution only when HDP flush register is invalid Change-Id: Ic284eba1f95ed22b0270d3abeb904fb902015b1a

SWDEV-433371 - Build needs comgr >= 2.8

1bd8b4e

Unbundle compressed code objects needs comgr 2.8 or higher Change-Id: I23942d2038e19b02c3ea5d3c9c1fe5367db87136

SWDEV-460091 - Handle cases where inline variables are included in mu…

68ddd8f

…ltiple translation unit. Change-Id: I92179ad198abbdaf5aec9c3c4ba76eeb6b0cc761

SWDEV-301667 - Pass reference to kernel name

4a9d24a

Change-Id: I21abe109ddfabfe7640bf78a96c81a1317d31952

SWDEV-451945 - Remove ShouldLoadPlatform function

2eb3037

Change-Id: Iabb4071bb77201576bc2c0488a04f4fa188815df

SWDEV-460151 - Enabling gfx12 in HIP runtime.

7ad1416

Change-Id: I374ea7c3119b0c61f9846a862c4a448ddb179748

SWDEV-460242 - Add system memory suballocator

933aa1d

Switch commands creation to the new suballocator to avoid frequent expensive OS calls Change-Id: I3597c811820e577c15708bad8b8a41aa53acc400

Increase private segment size limit to match HW limit

c76223c

publixsubfan requested review from gandryey, mangupta and rakesroy as code owners May 9, 2024 02:09

chrispaquot force-pushed the develop branch from a0f778a to 4ec3a97 Compare May 23, 2024 23:47

chrispaquot requested a review from gargrahul as a code owner May 23, 2024 23:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase private segment limit for dynamic scratch kernels #80

Increase private segment limit for dynamic scratch kernels #80

publixsubfan commented May 9, 2024

Increase private segment limit for dynamic scratch kernels #80

Are you sure you want to change the base?

Increase private segment limit for dynamic scratch kernels #80

Conversation

publixsubfan commented May 9, 2024

Details