Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: ROCM5.7.3, RCCL2.19.4 GPU kernel can't printf。Hash value collision detected #73

Closed
yangyangv8 opened this issue Apr 9, 2024 · 5 comments
Labels
duplicate This issue or pull request already exists

Comments

@yangyangv8
Copy link

yangyangv8 commented Apr 9, 2024

Problem Description

Problem Description

In the rccl file prims_simple.h,I have added a section of printf in this kernel function, such as :

device forceinline void genericOp(
intptr_t srcIx, intptr_t dstIx, int nelem, bool postOp
) {
constexpr int DirectRecv = /1 &&/ Direct && DirectRecv1;
constexpr int DirectSend = /1 &&/ Direct && DirectSend1;
constexpr int Src = SrcBuf != -1;
constexpr int Dst = DstBuf != -1;
nelem = nelem < 0 ? 0 : nelem;
int sliceSize = stepSizeStepPerSlice;
sliceSize = max(divUp(nelem, 16
SlicePerChunk)*16, sliceSize/32);
int slice = 0;
int offset = 0;
if(tid == 0) {
printf("in genericOp \n");
}

when i run rccl test, Use this command ./build/sendrecv_perf -b 8 -e 128M -f 2 -t 1 -g 2,will report this error:

enquence.cc Current function: ncclLaunchKernel line 1090
:1:rocvirtual.cpp :2945: 74877529363 us: [pid:44406 tid:0x7f26f4922c00] Pcie atomics not enabled, hostcall not supported
:1:rocvirtual.cpp :3280: 74877529375 us: [pid:44406 tid:0x7f26f4922c00] AQL dispatch failed!
yz-adm3: Test NCCL failure /home/yang.yang/yy/work/test-rccl/build/src/hipify/common.cu.cpp:451 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / '

After seeing the explanation here https://rocm.docs.amd.com/en/latest/about/CHANGELOG.html#non-hostcall-hip-printf, I have added the following settings in the RCCL CMakelists.txt file :

target_compile_options(rccl PRIVATE -mprintf-kind=buffered)

makefiles/common.mk:
CXXFLAGS := -DCUDA_MAJOR=$(CUDA_MAJOR) -DCUDA_MINOR=$(CUDA_MINOR) -fPIC -fvisibility=hidden
-Wall -mprintf-kind=buffered -g -Wno-unused-function -Wno-sign-compare -std=c++11 -Wvla
-I $(CUDA_INC)
$(CXXFLAGS)

After compiling RCCL, reported this error :

enquence.cc Current function: ncclLaunchKernel line 1090
:1:devhcprintf.cpp :265 : 81559524344 us: [pid:65800 tid:0x7f0d2c53d440] Hash value collision detected, printf buffer ill formed
:1:rocvirtual.cpp :3188: 81559524353 us: [pid:65800 tid:0x7f0d2c53d440]
Could not print data from the printf buffer!
:1:rocvirtual.cpp :3280: 81559524355 us: [pid:65800 tid:0x7f0d2c53d440] AQL dispatch failed!
:1:devhcprintf.cpp :265 : 81559524402 us: [pid:65799 tid:0x7ff8fd860440] Hash value collision detected, printf buffer ill formed
:1:rocvirtual.cpp :3188: 81559524410 us: [pid:65799 tid:0x7ff8fd860440]
Could not print data from the printf buffer!
:1:rocvirtual.cpp :3280: 81559524416 us: [pid:65799 tid:0x7ff8fd860440] AQL dispatch failed!
[rank0]: RuntimeError: NCCL Error 1: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
[rank1]: RuntimeError: NCCL Error 1: unhandled cuda error (run with NCCL_DEBUG=INFO for details)

I have set these environment variables
export HIP_KERNEL_PRINTF=1
export HIP_ENABLE_PRINTF=1
export HCC_ENABLE_PRINTF=1
export AMD_LOG_LEVEL=1

Using a Linux server with two GPU cards, Without printf, the program executes normally, How should I solve this problem?

Operating System

22.04.1 LTS (Jammy Jellyfish)

CPU

12th Gen Intel(R) Core(TM) i7-12700

GPU

AMD Radeon RX 7900 XTX

ROCm Version

ROCm 5.7.0

ROCm Component

HIP, HIPCC, rccl

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

@yangyangv8 yangyangv8 changed the title [Issue]: ROCM5.7.3, RCCL2.15.5 GPU kernel can't printf [Issue]: ROCM5.7.3, RCCL2.15.5 GPU kernel can't printf。Hash value collision detected, Apr 9, 2024
@yangyangv8 yangyangv8 changed the title [Issue]: ROCM5.7.3, RCCL2.15.5 GPU kernel can't printf。Hash value collision detected, [Issue]: ROCM5.7.3, RCCL2.15.5 GPU kernel can't printf。Hash value collision detected Apr 9, 2024
@mangupta
Copy link
Contributor

mangupta commented Apr 9, 2024

@yangyangv8 : Can you confirm that the test that you are running i.e. "./build/sendrecv_perf -b 8 -e 128M -f 2 -t 1 -g 2" runs fine if you rebuild rccl from source even if you don't add the printf in the kernel?

@yangyangv8
Copy link
Author

@mangupta I have confirmed that the program runs normally without adding printf in the kernel.

@yangyangv8
Copy link
Author

@mangupta hello, Is there any outcome to this issue now?

@yangyangv8 yangyangv8 changed the title [Issue]: ROCM5.7.3, RCCL2.15.5 GPU kernel can't printf。Hash value collision detected [Issue]: ROCM5.7.3, RCCL2.19.4 GPU kernel can't printf。Hash value collision detected Apr 26, 2024
@ppanchad-amd
Copy link

Hi @yangyangv8, created an internal ticket to investigate your issue. Thanks!

@sohaibnd
Copy link

Hi @yangyangv8, sorry for the delayed response.

I am closing this issue since it is a duplicate of github.com/ROCm/ROCm/issues/3001 and is being addressed there. Also, note that this is an issue directed to the rccl repo so should ideally be created there.

@sohaibnd sohaibnd added the duplicate This issue or pull request already exists label Sep 24, 2024
@sohaibnd sohaibnd closed this as not planned Won't fix, can't repro, duplicate, stale Sep 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

4 participants