Skip to content

[Issue]: Compatibility issue with rocm 6.0.2  #10

@pelahi

Description

@pelahi

Problem Description

The are two issues encountered when using rocm 6.0.2.

  1. The first one might be related to building a rocm container on a machine lacking an AMD gpu. The build of rocm used amdgpu-install -y --usecase=hiplibsdk,rocm,hip,opencl to install, which in earlier versions defined __HIP_PLATFORM_AMD__ but this not defined. The result is configure will fail
checking for hip/hip_runtime.h... no
configure: error: unable to find required headers

This is uninformative and a deeper look at the config.log shows

configure:4638: checking for hip/hip_runtime.h
configure:4638: gcc-12 -c -I/opt/rocm/include -I/opt/rocm/include -I/usr/include  -I/usr/include  conftest.c >&5
In file included from conftest.c:60:
/opt/rocm/include/hip/hip_runtime.h:66:2: error: #error ("Must define exactly one of __HIP_PLATFORM_AMD__ or __HIP_PLATFORM_NVIDIA__");
   66 | #error("Must define exactly one of __HIP_PLATFORM_AMD__ or __HIP_PLATFORM_NVIDIA__");
      |  ^~~~~
In file included from /opt/rocm/include/hip/hip_runtime.h:70:
/opt/rocm/include/hip/hip_runtime_api.h:8575:2: error: #error ("Must define exactly one of __HIP_PLATFORM_AMD__ or __HIP_PLATFORM_NVIDIA__");
 8575 | #error("Must define exactly one of __HIP_PLATFORM_AMD__ or __HIP_PLATFORM_NVIDIA__");
      |  ^~~~~
In file included from /opt/rocm/include/hip/hip_runtime.h:71:
/opt/rocm/include/hip/library_types.h:75:2: error: #error ("Must define exactly one of __HIP_PLATFORM_AMD__ or __HIP_PLATFORM_NVIDIA__");
   75 | #error("Must define exactly one of __HIP_PLATFORM_AMD__ or __HIP_PLATFORM_NVIDIA__");
      |  ^~~~~
In file included from /opt/rocm/include/hip/hip_runtime.h:73:
/opt/rocm/include/hip/hip_vector_types.h:38:2: error: #error ("Must define exactly one of __HIP_PLATFORM_AMD__ or __HIP_PLATFORM_NVIDIA__");
   38 | #error("Must define exactly one of __HIP_PLATFORM_AMD__ or __HIP_PLATFORM_NVIDIA__");
      |  ^~~~~

It is just a matter of defining the compilation argument but it wasn't necessary in previous versions to do so explicitly.

The other issue is a compilation issue. With changes made to hipPointerAttribute_t the code will not compile, giving a message

make[2]: Entering directory '/tmp/aws-ofi-rccl/src'
  CC       nccl_ofi_net.lo
nccl_ofi_net.c: In function 'get_cuda_device':
nccl_ofi_net.c:497:17: error: 'struct hipPointerAttribute_t' has no member named 'memoryType'
  497 |         if (attr.memoryType == hipMemoryTypeDevice) {
      |                 ^
make[2]: *** [Makefile:435: nccl_ofi_net.lo] Error 1

The fix is to update this line to use attr.type.

Operating System

Ubuntu 22.04 LTS

CPU

AMD EPYC-Rome with no GPU

GPU

AMD Instinct MI250X

ROCm Version

ROCm 6.0.0

ROCm Component

No response

Steps to Reproduce

Here is the section from the Docker recipe and shows the instructions that I am running.

ARG ROCM_VERSION=6.0.2
RUN echo "Building rocm ${ROCM_VERSION}" \
    && rocm_major=$(echo ${ROCM_VERSION} | sed "s/\./ /g" | awk '{print $1}') \
    && rocm_minor=$(echo ${ROCM_VERSION} | sed "s/\./ /g" | awk '{print $2}') \
    && ROCM_INSTALLER_VERSION=$(echo ${ROCM_VERSION} | sed "s/\./0/g") \
    # if rocm version does not list minor patch version number add 00 to end of installer version
    && if [ $(echo ${ROCM_VERSION} | sed "s/\./\n/g" | wc -l) -eq "2" ]; then ROCM_INSTALLER_VERSION=${ROCM_INSTALLER_VERSION}"00"; fi \
    && ROCM_INSTALLER_VERSION=${ROCM_INSTALLER_VERSION}"-1" \
    && ROCM_INSTALLER_VERSION=${rocm_major}.${rocm_minor}.${ROCM_INSTALLER_VERSION} \
	&& cd /tmp/build \
    # && wget https://bootstrap.pypa.io/get-pip.py \
    # && python3 get-pip.py \
    && roc_url="https://repo.radeon.com/amdgpu-install/"${ROCM_VERSION}"/ubuntu/jammy/amdgpu-install_"${ROCM_INSTALLER_VERSION}"_all.deb" \
    && echo ${roc_url} \
	&& wget ${roc_url} \
	&& apt -y install ./amdgpu-install_${ROCM_INSTALLER_VERSION}_all.deb \
	&& amdgpu-install -y --usecase=hiplibsdk,rocm,hip,opencl \
    && cd /tmp/build && rm -rf amdgpu-install_${ROCM_INSTALLER_VERSION}_all.deb \
	echo "Done"

# Install aws-ofi-rccl
ARG RCCL_CONFIGURE_OPTIONS="--prefix=/usr --with-mpi=/usr --with-libfabric=/usr --with-hip=/opt/rocm --with-rccl=/opt/rocm CC=gcc-12 CXX=g++-12"
RUN echo "Build rccl" \
    && git clone https://github.com/ROCmSoftwarePlatform/aws-ofi-rccl.git \
	&& cd aws-ofi-rccl \
	&& ./autogen.sh \
	&& ./configure ${RCCL_CONFIGURE_OPTIONS}} \
	&& make -j 16 \
	&& make install \
        && cd /tmp \
	&& rm -rf /tmp/build \
	&& echo "Done"

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions