-
Notifications
You must be signed in to change notification settings - Fork 17
Open
Description
Problem Description
The are two issues encountered when using rocm 6.0.2.
- The first one might be related to building a rocm container on a machine lacking an AMD gpu. The build of rocm used
amdgpu-install -y --usecase=hiplibsdk,rocm,hip,openclto install, which in earlier versions defined__HIP_PLATFORM_AMD__but this not defined. The result is configure will fail
checking for hip/hip_runtime.h... no
configure: error: unable to find required headers
This is uninformative and a deeper look at the config.log shows
configure:4638: checking for hip/hip_runtime.h
configure:4638: gcc-12 -c -I/opt/rocm/include -I/opt/rocm/include -I/usr/include -I/usr/include conftest.c >&5
In file included from conftest.c:60:
/opt/rocm/include/hip/hip_runtime.h:66:2: error: #error ("Must define exactly one of __HIP_PLATFORM_AMD__ or __HIP_PLATFORM_NVIDIA__");
66 | #error("Must define exactly one of __HIP_PLATFORM_AMD__ or __HIP_PLATFORM_NVIDIA__");
| ^~~~~
In file included from /opt/rocm/include/hip/hip_runtime.h:70:
/opt/rocm/include/hip/hip_runtime_api.h:8575:2: error: #error ("Must define exactly one of __HIP_PLATFORM_AMD__ or __HIP_PLATFORM_NVIDIA__");
8575 | #error("Must define exactly one of __HIP_PLATFORM_AMD__ or __HIP_PLATFORM_NVIDIA__");
| ^~~~~
In file included from /opt/rocm/include/hip/hip_runtime.h:71:
/opt/rocm/include/hip/library_types.h:75:2: error: #error ("Must define exactly one of __HIP_PLATFORM_AMD__ or __HIP_PLATFORM_NVIDIA__");
75 | #error("Must define exactly one of __HIP_PLATFORM_AMD__ or __HIP_PLATFORM_NVIDIA__");
| ^~~~~
In file included from /opt/rocm/include/hip/hip_runtime.h:73:
/opt/rocm/include/hip/hip_vector_types.h:38:2: error: #error ("Must define exactly one of __HIP_PLATFORM_AMD__ or __HIP_PLATFORM_NVIDIA__");
38 | #error("Must define exactly one of __HIP_PLATFORM_AMD__ or __HIP_PLATFORM_NVIDIA__");
| ^~~~~
It is just a matter of defining the compilation argument but it wasn't necessary in previous versions to do so explicitly.
The other issue is a compilation issue. With changes made to hipPointerAttribute_t the code will not compile, giving a message
make[2]: Entering directory '/tmp/aws-ofi-rccl/src'
CC nccl_ofi_net.lo
nccl_ofi_net.c: In function 'get_cuda_device':
nccl_ofi_net.c:497:17: error: 'struct hipPointerAttribute_t' has no member named 'memoryType'
497 | if (attr.memoryType == hipMemoryTypeDevice) {
| ^
make[2]: *** [Makefile:435: nccl_ofi_net.lo] Error 1
The fix is to update this line to use attr.type.
Operating System
Ubuntu 22.04 LTS
CPU
AMD EPYC-Rome with no GPU
GPU
AMD Instinct MI250X
ROCm Version
ROCm 6.0.0
ROCm Component
No response
Steps to Reproduce
Here is the section from the Docker recipe and shows the instructions that I am running.
ARG ROCM_VERSION=6.0.2
RUN echo "Building rocm ${ROCM_VERSION}" \
&& rocm_major=$(echo ${ROCM_VERSION} | sed "s/\./ /g" | awk '{print $1}') \
&& rocm_minor=$(echo ${ROCM_VERSION} | sed "s/\./ /g" | awk '{print $2}') \
&& ROCM_INSTALLER_VERSION=$(echo ${ROCM_VERSION} | sed "s/\./0/g") \
# if rocm version does not list minor patch version number add 00 to end of installer version
&& if [ $(echo ${ROCM_VERSION} | sed "s/\./\n/g" | wc -l) -eq "2" ]; then ROCM_INSTALLER_VERSION=${ROCM_INSTALLER_VERSION}"00"; fi \
&& ROCM_INSTALLER_VERSION=${ROCM_INSTALLER_VERSION}"-1" \
&& ROCM_INSTALLER_VERSION=${rocm_major}.${rocm_minor}.${ROCM_INSTALLER_VERSION} \
&& cd /tmp/build \
# && wget https://bootstrap.pypa.io/get-pip.py \
# && python3 get-pip.py \
&& roc_url="https://repo.radeon.com/amdgpu-install/"${ROCM_VERSION}"/ubuntu/jammy/amdgpu-install_"${ROCM_INSTALLER_VERSION}"_all.deb" \
&& echo ${roc_url} \
&& wget ${roc_url} \
&& apt -y install ./amdgpu-install_${ROCM_INSTALLER_VERSION}_all.deb \
&& amdgpu-install -y --usecase=hiplibsdk,rocm,hip,opencl \
&& cd /tmp/build && rm -rf amdgpu-install_${ROCM_INSTALLER_VERSION}_all.deb \
echo "Done"
# Install aws-ofi-rccl
ARG RCCL_CONFIGURE_OPTIONS="--prefix=/usr --with-mpi=/usr --with-libfabric=/usr --with-hip=/opt/rocm --with-rccl=/opt/rocm CC=gcc-12 CXX=g++-12"
RUN echo "Build rccl" \
&& git clone https://github.com/ROCmSoftwarePlatform/aws-ofi-rccl.git \
&& cd aws-ofi-rccl \
&& ./autogen.sh \
&& ./configure ${RCCL_CONFIGURE_OPTIONS}} \
&& make -j 16 \
&& make install \
&& cd /tmp \
&& rm -rf /tmp/build \
&& echo "Done"
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
Metadata
Metadata
Assignees
Labels
No labels