Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CMake failing for nvidiagpu on Perlmutter-GPU #55

Closed
xylar opened this issue Jan 29, 2024 · 24 comments
Closed

CMake failing for nvidiagpu on Perlmutter-GPU #55

xylar opened this issue Jan 29, 2024 · 24 comments
Labels
bug Something isn't working CMake CMake-related issues

Comments

@xylar
Copy link

xylar commented Jan 29, 2024

When I run:

module load cmake

mkdir -p build_omega/build_pm-gpu_nvidiagpu
cd build_omega/build_pm-gpu_nvidiagpu

export METIS_ROOT=/pscratch/sd/x/xylar/spack_pm-gpu_test//dev_polaris_0_3_0_nvidiagpu_mpich/var/spack/environments/dev_polaris_0_3_0_nvidiagpu_mpich/.spack-env/view
export PARMETIS_ROOT=/pscratch/sd/x/xylar/spack_pm-gpu_test//dev_polaris_0_3_0_nvidiagpu_mpich/var/spack/environments/dev_polaris_0_3_0_nvidiagpu_mpich/.spack-env/view

cmake \
   -DOMEGA_BUILD_TYPE=Release \
   -DOMEGA_CIME_COMPILER=nvidiagpu \
   -DOMEGA_CIME_MACHINE=pm-gpu \
   -DOMEGA_METIS_ROOT=${METIS_ROOT}\
   -DOMEGA_PARMETIS_ROOT=${PARMETIS_ROOT}\
   -DOMEGA_BUILD_TEST=ON \
   -S /global/u2/x/xylar/e3sm_work/polaris/add-omega-ctest-util/e3sm_submodules/Omega/components/omega/ \
   -B . 

I'm seeing:

-- Cray Programming Environment 2.7.20 Fortran
CMake Error at /global/common/software/nersc/pm-2021q4/sw/cmake-3.22.0/share/cmake-3.22/Modules/CMakeDetermineCompilerId.cmake:726 (message):
  Compiling the CUDA compiler identification source file
  "CMakeCUDACompilerId.cu" failed.

  Compiler:
  /global/common/software/nersc9/nvidia_hpc_sdk/Linux_x86_64/21.11/cuda/11.5/bin/nvcc


  Build flags:

  Id flags: --keep;--keep-dir;tmp;-ccbin=/opt/cray/pe/craype/2.7.20/bin/CC -v

  

  The output was:

  2

  #$ _NVVM_BRANCH_=nvvm

  #$ _SPACE_=

  #$ _CUDART_=cudart

  #$
  _HERE_=/global/common/software/nersc9/nvidia_hpc_sdk/Linux_x86_64/21.11/cuda/11.5/bin


  #$
  _THERE_=/global/common/software/nersc9/nvidia_hpc_sdk/Linux_x86_64/21.11/cuda/11.5/bin


  #$ _TARGET_SIZE_=

  #$ _TARGET_DIR_=

  #$ _TARGET_DIR_=targets/x86_64-linux

  #$
  TOP=/global/common/software/nersc9/nvidia_hpc_sdk/Linux_x86_64/21.11/cuda/11.5/bin/..


  #$
  NVVMIR_LIBRARY_DIR=/global/common/software/nersc9/nvidia_hpc_sdk/Linux_x86_64/21.11/cuda/11.5/bin/../nvvm/libdevice


  #$
  LD_LIBRARY_PATH=/global/common/software/nersc9/nvidia_hpc_sdk/Linux_x86_64/21.11/cuda/11.5/bin/../lib:/opt/cray/pe/mpich/8.1.25/ofi/nvidia/20.7/lib:/opt/cray/pe/mpich/8.1.25/gtl/lib:/global/common/software/nersc9/nvidia_hpc_sdk/Linux_x86_64/21.11/math_libs/11.5/lib64:/global/common/software/nersc9/nvidia_hpc_sdk/Linux_x86_64/21.11/cuda/11.5/extras/CUPTI/lib64:/global/common/software/nersc9/nvidia_hpc_sdk/Linux_x86_64/21.11/cuda/11.5/extras/Debugger/lib64:/global/common/software/nersc9/nvidia_hpc_sdk/Linux_x86_64/21.11/cuda/11.5/nvvm/lib64:/global/common/software/nersc9/nvidia_hpc_sdk/Linux_x86_64/21.11/cuda/11.5/lib64:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/math_libs/lib64:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/compilers/lib:/opt/cray/libfabric/1.15.2.0/lib64


  #$
  PATH=/global/common/software/nersc9/nvidia_hpc_sdk/Linux_x86_64/21.11/cuda/11.5/bin/../nvvm/bin:/global/common/software/nersc9/nvidia_hpc_sdk/Linux_x86_64/21.11/cuda/11.5/bin:/global/common/software/nersc/pm-2022q4/spack/linux-sles15-zen/cmake-3.24.3-k5msymx/bin:/opt/cray/pe/parallel-netcdf/1.12.3.3/bin:/opt/cray/pe/netcdf-hdf5parallel/4.9.0.3/bin:/opt/cray/pe/hdf5-parallel/1.12.2.3/bin:/opt/cray/pe/hdf5/1.12.2.3/bin:/opt/cray/pe/mpich/8.1.25/ofi/nvidia/20.7/bin:/opt/cray/pe/mpich/8.1.25/bin:/opt/cray/pe/craype/2.7.20/bin:/global/common/software/nersc9/nvidia_hpc_sdk/Linux_x86_64/21.11/cuda/11.5/compute-sanitizer:/global/common/software/nersc9/nvidia_hpc_sdk/Linux_x86_64/21.11/cuda/11.5/bin:/global/common/software/nersc9/nvidia_hpc_sdk/Linux_x86_64/21.11/cuda/11.5/libnvvp:/global/common/software/nersc9/nvidia_hpc_sdk/Linux_x86_64/21.11/profilers/Nsight_Compute:/global/common/software/nersc9/nvidia_hpc_sdk/Linux_x86_64/21.11/profilers/Nsight_Systems/bin:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/compilers/extras/qd/bin:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/compilers/bin:/opt/nersc/pe/bin:/global/common/software/nersc/bin:/opt/cray/libfabric/1.15.2.0/bin:/usr/local/bin:/usr/bin:/bin:/usr/lib/mit/bin:/opt/cray/pe/bin


  #$
  INCLUDES="-I/global/common/software/nersc9/nvidia_hpc_sdk/Linux_x86_64/21.11/cuda/11.5/bin/../targets/x86_64-linux/include"


  #$ LIBRARIES=
  "-L/global/common/software/nersc9/nvidia_hpc_sdk/Linux_x86_64/21.11/cuda/11.5/bin/../targets/x86_64-linux/lib/stubs"
  "-L/global/common/software/nersc9/nvidia_hpc_sdk/Linux_x86_64/21.11/cuda/11.5/bin/../targets/x86_64-linux/lib"


  #$ CUDAFE_FLAGS=

  #$ PTXAS_FLAGS=

  #$ rm tmp/a_dlink.reg.c

  #$ "/opt/cray/pe/craype/2.7.20/bin"/CC -D__CUDA_ARCH__=520
  -D__CUDA_ARCH_LIST__=520 -E --nvcchost -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS
  -D__CUDACC__ -D__NVCC__
  "-I/global/common/software/nersc9/nvidia_hpc_sdk/Linux_x86_64/21.11/cuda/11.5/bin/../targets/x86_64-linux/include"
  -D__CUDACC_VER_MAJOR__=11 -D__CUDACC_VER_MINOR__=5
  -D__CUDACC_VER_BUILD__=119 -D__CUDA_API_VER_MAJOR__=11
  -D__CUDA_API_VER_MINOR__=5 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 --preinclude
  "cuda_runtime.h" -m64 "CMakeCUDACompilerId.cu" -o
  "tmp/CMakeCUDACompilerId.cpp1.ii"

  
  "/global/common/software/nersc9/nvidia_hpc_sdk/Linux_x86_64/21.11/cuda/11.5/bin/../targets/x86_64-linux/include/crt/host_config.h",
  line 118: catastrophic error: #error directive: -- unsupported pgc++
  configuration! Only pgc++ 18, 19, 20 and 21 are supported! The nvcc flag
  '-allow-unsupported-compiler' can be used to override this version check;
  however, using an unsupported host compiler may cause compilation failure
  or incorrect run time execution.  Use at your own risk.

    #error -- unsupported pgc++ configuration! Only pgc++ 18, 19, 20 and 21 are supported! The nvcc flag '-allow-unsupported-compiler' can be used to override this version check; however, using an unsupported host compiler may cause compilation failure or incorrect run time execution. Use at your own risk.
     ^

  

  1 catastrophic error detected in the compilation of
  "CMakeCUDACompilerId.cu".

  Compilation terminated.

  # --error 0x2 --
@xylar xylar added bug Something isn't working CMake CMake-related issues labels Jan 29, 2024
@xylar
Copy link
Author

xylar commented Jan 29, 2024

@grnydawn, same deal as in #54. Again, I know it a lot. Since this one affects a GPU compiler we want to support, it's a bit higher priority than #54 and #52.

And, again, it's entirely possible I've made a mistake.

@grnydawn
Copy link

@xylar I also got the same issue on Perlmutter. I think I need to understand more about the impact of using the compiler configurations from CIME.

@xylar
Copy link
Author

xylar commented Jan 29, 2024

@grnydawn, no problem. I'm a total amateur at CMake so I really appreciate what you've done so far. We're all learning the process here.

@grnydawn
Copy link

@xylar , this issue is not shown when I use a newer cudatoolkit module(12.2) than one(11.5) in config_machines.xml. However, after passing this issue, the same issue with #54 occurs with ekat/yaml compilation.

@xylar
Copy link
Author

xylar commented Jul 7, 2024

@grnydawn, it looks like E3SM might be forced to move do a new cudatoolkit on Perlmutter-GPU soon, see https://acmeclimate.slack.com/archives/C021DPJEL9X/p1720035170508459.. Let's keep an eye on it and check back on this issue if that happens.

@xylar
Copy link
Author

xylar commented Jul 10, 2024

With #97, I'm able to build CTests but I'm seeing:

no CUDA-capable device is detected

over and over, specifically:

0: terminate called after throwing an instance of 'std::runtime_error'
0:   what():  cudaGetDeviceCount(&m_cudaDevCount) error( cudaErrorNoDevice): no CUDA-capable device is detected /global/u2/x/xylar/e3sm_work/polaris/main/e3sm_submodules/omega/develop/externals/ekat/extern/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:275

@grnydawn
Copy link

grnydawn commented Jul 10, 2024

@xylar, I tried to reproduce this issue on pm-gpu, but I couldn't. I was able to build and run most, but not all, of the test cases without a problem on pm-gpu using the nvidiagpu compiler. One thing I want to check is whether an Omega scripts for build/run can use a relative path to source a file created in the E3SM case. The omega_build.sh and omega_ctest.sh scripts source "./e3smcase/.env_mach_specific.sh" before they execute. It seems that the sourcing was not completed.

@xylar
Copy link
Author

xylar commented Jul 10, 2024

Okay, I will try again and document my difficulties in more detail.

@xylar
Copy link
Author

xylar commented Jul 10, 2024

@grnydawn, are you able to build with develop? I wasn't even able to do that.

@grnydawn
Copy link

@xylar Ah, I ran my test using Phil's omega/sync-e3sm, not develop.

@xylar
Copy link
Author

xylar commented Jul 10, 2024

Okay, that's what I will continue doing as well.

@xylar
Copy link
Author

xylar commented Jul 10, 2024

I was asking because I can't even build with develop.

@grnydawn
Copy link

I am going to build using develop branch now.

@xylar
Copy link
Author

xylar commented Jul 10, 2024

Building seems to work fine (with sync-e3sm). Here's what I'm running:

#!/usr/bin/env bash

cwd=${PWD}

module load cmake
# quit on errors
set -e
# trace commands
set -x

cd /global/u2/x/xylar/e3sm_work/polaris/main/e3sm_submodules/omega/sync-e3sm

git submodule update --init --recursive externals/YAKL externals/ekat \
    externals/scorpio cime

cd ${cwd}

rm -rf build_omega/build_pm-gpu_nvidiagpu

mkdir -p build_omega/build_pm-gpu_nvidiagpu
cd build_omega/build_pm-gpu_nvidiagpu

export METIS_ROOT=/global/cfs/cdirs/e3sm/software/polaris/pm-gpu/spack/dev_polaris_0_3_0_nvidiagpu_mpich/var/spack/environments/dev_polaris_0_3_0_nvidiagpu_mpich/.spack-env/view
export PARMETIS_ROOT=/global/cfs/cdirs/e3sm/software/polaris/pm-gpu/spack/dev_polaris_0_3_0_nvidiagpu_mpich/var/spack/environments/dev_polaris_0_3_0_nvidiagpu_mpich/.spack-env/view

cmake \
   -DOMEGA_BUILD_TYPE=Release \
   -DOMEGA_CIME_COMPILER=nvidiagpu \
   -DOMEGA_CIME_MACHINE=pm-gpu \
   -DOMEGA_METIS_ROOT=${METIS_ROOT} \
   -DOMEGA_PARMETIS_ROOT=${PARMETIS_ROOT} \
   -DOMEGA_BUILD_TEST=ON \
   -Wno-dev \
   -S /global/u2/x/xylar/e3sm_work/polaris/main/e3sm_submodules/omega/sync-e3sm/components/omega \
   -B . 

./omega_build.sh

cd test

ln -sfn /global/cfs/cdirs/e3sm/polaris/ocean/omega_ctest/ocean.QU.240km.151209.nc OmegaMesh.nc
ln -sfn /global/cfs/cdirs/e3sm/polaris/ocean/omega_ctest/PlanarPeriodic48x48.nc OmegaPlanarMesh.nc
ln -sfn /global/cfs/cdirs/e3sm/polaris/ocean/omega_ctest/cosine_bell_icos480_initial_state.230220.nc OmegaSphereMesh.nc

@xylar
Copy link
Author

xylar commented Jul 10, 2024

My job script looks like this:

#!/bin/bash
#SBATCH  --job-name=omega_ctest_pm-gpu_nvidiagpu
#SBATCH  --account=e3sm
#SBATCH  --nodes=1
#SBATCH  --output=omega_ctest_pm-gpu_nvidiagpu.o%j
#SBATCH  --exclusive
#SBATCH  --time=0:15:00
#SBATCH  --qos=debug
#SBATCH  --constraint=gpu

cd /global/u2/x/xylar/e3sm_work/polaris/main/build_omega/build_pm-gpu_nvidiagpu
./omega_ctest.sh

@grnydawn, do you see any issues there?

@grnydawn
Copy link

@xylar Could you add "-DOMEGA_ARCH=CUDA" in the cmake command line. Omega build system may detect that it is CUDA build but I have to check if that is right or not.

@xylar
Copy link
Author

xylar commented Jul 10, 2024

I can add that manually now. How do I know in general with the correct OMEGA_ARCH is for a given compiler on a given machine?

@grnydawn
Copy link

grnydawn commented Jul 10, 2024

I think that knowing the compiler and machine does not guarantee the user's intended build target architecture. However, the Omega build system checks if nvcc or hipcc is available on the system if OMEGA_ARCH is not specified, and tries to use one of them according to a certain compiler priority. But, in general, the Omega build system does not know the user's intended target architecture.

@xylar
Copy link
Author

xylar commented Jul 10, 2024

Okay, so is a user supposed to know their OMEGA_ARCH in addition to their machine and compiler? It seems like only -DOMEGA_ARCH=CUDA works for nvidiagpu and my guess is the same is true for gnugpu on pm-gpu. At least under Polaris, it would be nice if Omega users don't need to know what OMEGA_ARCH to use (or at least if there's a sensible default for every compiler). So what I'm asking then is now do I determine a sensible default for OMEGA_ARCH for all compilers and machines?

@xylar
Copy link
Author

xylar commented Jul 10, 2024

Also, it doesn't seem like detection of nvcc worked successfully since CUDA does not seem to have been detected. Maybe that's because it's not actually loaded into the environment I'm building from?

@xylar
Copy link
Author

xylar commented Jul 10, 2024

Unfortunately, rebuilding with -DOMEGA_ARCH=CUDA is producing the same errors for me as before:

2:   what():  (CudaInternal::singleton().cuda_get_device_count_wrapper<false>( &m_cudaDevCount)) error( cudaErrorNoDevice): no CUDA-capable device is detected /global/homes/x/xylar/e3sm_work/polaris/main/e3sm_submodules/omega/sync-e3sm/externals/ekat/extern/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:280

@xylar
Copy link
Author

xylar commented Jul 10, 2024

I'm going to move the conversation to Slack.

@xylar
Copy link
Author

xylar commented Jul 10, 2024

Okay, this worked for me when I added --gpus=4 to by job script. How embarrassing!!!

@grnydawn, I think we can close this issue.

@grnydawn
Copy link

@xylar, I am good to close this issue. Thanks for the work! BTW, we may continue to discuss about the usage of OMEGA_ARCH somewhere else.

@xylar xylar closed this as completed Jul 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working CMake CMake-related issues
Projects
None yet
Development

No branches or pull requests

2 participants