-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rocm_ipc_md.c:79 UCX ERROR Failed to create ipc for 0x7f9030c18000/8000 #10087
Comments
@jinz2014 this is most likely a system setup / permission issue on your side, since UCX 1.15 has been used extensively with numerous application on MI100. Can you please check the following things:
|
The answers are yes to both questions. Verified allreduce for size 0 (19.865 us per iteration) |
Could you please provide the full command line that you used? I see that the put_zcopy protocol is being utilized, which is not the default with 1.15, it should be the get_zcopy protocol. |
Sorry, I don't know the two protocols. "make run" shows the full command: $HOME/ompi_for_gpu/ompi/bin/mpirun -n 2 ./main Thank you for the instructions. |
So just for a test, could you change the command line to the following:
to see whether it makes a difference? |
Ok. $HOME/ompi_for_gpu/ompi/bin/mpirun -x UCX_RNDV_SCHEME=get_zcopy -n 2 ./main |
Hm. Ok, I will see whether I can reproduce the issue locally. Are there instructions on how to compile the testcode on the github repo? |
export INSTALL_DIR=$HOME/ompi_for_gpu export UCX_DIR=$INSTALL_DIR/ucx export OMPI_DIR=$INSTALL_DIR/ompi export LD_LIBRARY_PATH=$OMPI_DIR/lib:$UCX_DIR/lib:/opt/rocm/lib The example is in https://github.com/zjin-lcf/HeCBench/tree/master/src/allreduce-hip
will build and run the program. The CUDA example is migrated to the HIP example. I didn't observe errors when running the CUDA code, so am not clear where the issue in the HIP example is. Thanks |
ok, so but just clarify, compiling the example is simply |
make run The original CUDA code is https://github.com/baidu-research/baidu-allreduce |
I can confirm that I can reproduce the issue. It is in my case an MI250X system with ROCm 6.2 and UCX 1.16 (that is my default development platform at the moment), but the same error is occurring. I will put it on my list of items to work on, but it might be more towards the end of the week until I get to it. |
Okay. |
I think I know what the issue is, but I do not know yet whether its something that we are doing wrong in the rocm components of UCX or whether its a bug in ROCm runtime layer. I have however a quick workaround in your code (since a proper fix might take a while): If you allocate the output buffer outside of the RingAllreduce test and pass it in as an argument to RingAllreduce (e.g. allocate just right before the Let me emphasize that your code is however correct, and it should work. |
I added another example Thank you for the workaround. |
Describe the issue
[1724610589.249079] [cousteau:2779987:0] rocm_ipc_md.c:79 UCX ERROR Failed to create ipc for 0x7f9030c18000/8000
[1724610589.249092] [cousteau:2779986:0] rocm_ipc_md.c:79 UCX ERROR Failed to create ipc for 0x7fd7af610000/8000
[cousteau:2779987:0:2779987] rndv.c:1872 Assertion
sreq->send.rndv.lanes_count > 0' failed [cousteau:2779986:0:2779986] rndv.c:1872 Assertion
sreq->send.rndv.lanes_count > 0' failedSteps to Reproduce
export INSTALL_DIR=$HOME/ompi_for_gpu
export BUILD_DIR=/tmp/ompi_for_gpu_build
mkdir -p $BUILD_DIR
export UCX_DIR=$INSTALL_DIR/ucx
cd $BUILD_DIR
git clone https://github.com/openucx/ucx.git -b v1.15.x
cd ucx
./autogen.sh
mkdir build
cd build
../configure -prefix=$UCX_DIR
--with-rocm=/opt/rocm
make -j $(nproc)
make -j $(nproc) install
export OMPI_DIR=$INSTALL_DIR/ompi
cd $BUILD_DIR
git clone --recursive https://github.com/open-mpi/ompi.git
-b v5.0.x
cd ompi
./autogen.pl
mkdir build
cd build
../configure --prefix=$OMPI_DIR --with-ucx=$UCX_DIR
--with-rocm=/opt/rocm
make -j $(nproc)
make install
export LD_LIBRARY_PATH=$OMPI_DIR/lib:$UCX_DIR/lib:/opt/rocm/lib
export PATH=$OMPI_DIR/bin:$PATH
The example is in https://github.com/zjin-lcf/HeCBench/tree/master/src/allreduce-hip
make run
Setup and versions
The text was updated successfully, but these errors were encountered: