CMake failing for `nvidiagpu` on Perlmutter-GPU #55

xylar · 2024-01-29T20:40:09Z

When I run:

module load cmake

mkdir -p build_omega/build_pm-gpu_nvidiagpu
cd build_omega/build_pm-gpu_nvidiagpu

export METIS_ROOT=/pscratch/sd/x/xylar/spack_pm-gpu_test//dev_polaris_0_3_0_nvidiagpu_mpich/var/spack/environments/dev_polaris_0_3_0_nvidiagpu_mpich/.spack-env/view
export PARMETIS_ROOT=/pscratch/sd/x/xylar/spack_pm-gpu_test//dev_polaris_0_3_0_nvidiagpu_mpich/var/spack/environments/dev_polaris_0_3_0_nvidiagpu_mpich/.spack-env/view

cmake \
   -DOMEGA_BUILD_TYPE=Release \
   -DOMEGA_CIME_COMPILER=nvidiagpu \
   -DOMEGA_CIME_MACHINE=pm-gpu \
   -DOMEGA_METIS_ROOT=${METIS_ROOT}\
   -DOMEGA_PARMETIS_ROOT=${PARMETIS_ROOT}\
   -DOMEGA_BUILD_TEST=ON \
   -S /global/u2/x/xylar/e3sm_work/polaris/add-omega-ctest-util/e3sm_submodules/Omega/components/omega/ \
   -B .

I'm seeing:

-- Cray Programming Environment 2.7.20 Fortran
CMake Error at /global/common/software/nersc/pm-2021q4/sw/cmake-3.22.0/share/cmake-3.22/Modules/CMakeDetermineCompilerId.cmake:726 (message):
  Compiling the CUDA compiler identification source file
  "CMakeCUDACompilerId.cu" failed.

  Compiler:
  /global/common/software/nersc9/nvidia_hpc_sdk/Linux_x86_64/21.11/cuda/11.5/bin/nvcc


  Build flags:

  Id flags: --keep;--keep-dir;tmp;-ccbin=/opt/cray/pe/craype/2.7.20/bin/CC -v

  

  The output was:

  2

  #$ _NVVM_BRANCH_=nvvm

  #$ _SPACE_=

  #$ _CUDART_=cudart

  #$
  _HERE_=/global/common/software/nersc9/nvidia_hpc_sdk/Linux_x86_64/21.11/cuda/11.5/bin


  #$
  _THERE_=/global/common/software/nersc9/nvidia_hpc_sdk/Linux_x86_64/21.11/cuda/11.5/bin


  #$ _TARGET_SIZE_=

  #$ _TARGET_DIR_=

  #$ _TARGET_DIR_=targets/x86_64-linux

  #$
  TOP=/global/common/software/nersc9/nvidia_hpc_sdk/Linux_x86_64/21.11/cuda/11.5/bin/..


  #$
  NVVMIR_LIBRARY_DIR=/global/common/software/nersc9/nvidia_hpc_sdk/Linux_x86_64/21.11/cuda/11.5/bin/../nvvm/libdevice


  #$
  LD_LIBRARY_PATH=/global/common/software/nersc9/nvidia_hpc_sdk/Linux_x86_64/21.11/cuda/11.5/bin/../lib:/opt/cray/pe/mpich/8.1.25/ofi/nvidia/20.7/lib:/opt/cray/pe/mpich/8.1.25/gtl/lib:/global/common/software/nersc9/nvidia_hpc_sdk/Linux_x86_64/21.11/math_libs/11.5/lib64:/global/common/software/nersc9/nvidia_hpc_sdk/Linux_x86_64/21.11/cuda/11.5/extras/CUPTI/lib64:/global/common/software/nersc9/nvidia_hpc_sdk/Linux_x86_64/21.11/cuda/11.5/extras/Debugger/lib64:/global/common/software/nersc9/nvidia_hpc_sdk/Linux_x86_64/21.11/cuda/11.5/nvvm/lib64:/global/common/software/nersc9/nvidia_hpc_sdk/Linux_x86_64/21.11/cuda/11.5/lib64:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/math_libs/lib64:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/compilers/lib:/opt/cray/libfabric/1.15.2.0/lib64


  #$
  PATH=/global/common/software/nersc9/nvidia_hpc_sdk/Linux_x86_64/21.11/cuda/11.5/bin/../nvvm/bin:/global/common/software/nersc9/nvidia_hpc_sdk/Linux_x86_64/21.11/cuda/11.5/bin:/global/common/software/nersc/pm-2022q4/spack/linux-sles15-zen/cmake-3.24.3-k5msymx/bin:/opt/cray/pe/parallel-netcdf/1.12.3.3/bin:/opt/cray/pe/netcdf-hdf5parallel/4.9.0.3/bin:/opt/cray/pe/hdf5-parallel/1.12.2.3/bin:/opt/cray/pe/hdf5/1.12.2.3/bin:/opt/cray/pe/mpich/8.1.25/ofi/nvidia/20.7/bin:/opt/cray/pe/mpich/8.1.25/bin:/opt/cray/pe/craype/2.7.20/bin:/global/common/software/nersc9/nvidia_hpc_sdk/Linux_x86_64/21.11/cuda/11.5/compute-sanitizer:/global/common/software/nersc9/nvidia_hpc_sdk/Linux_x86_64/21.11/cuda/11.5/bin:/global/common/software/nersc9/nvidia_hpc_sdk/Linux_x86_64/21.11/cuda/11.5/libnvvp:/global/common/software/nersc9/nvidia_hpc_sdk/Linux_x86_64/21.11/profilers/Nsight_Compute:/global/common/software/nersc9/nvidia_hpc_sdk/Linux_x86_64/21.11/profilers/Nsight_Systems/bin:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/compilers/extras/qd/bin:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/compilers/bin:/opt/nersc/pe/bin:/global/common/software/nersc/bin:/opt/cray/libfabric/1.15.2.0/bin:/usr/local/bin:/usr/bin:/bin:/usr/lib/mit/bin:/opt/cray/pe/bin


  #$
  INCLUDES="-I/global/common/software/nersc9/nvidia_hpc_sdk/Linux_x86_64/21.11/cuda/11.5/bin/../targets/x86_64-linux/include"


  #$ LIBRARIES=
  "-L/global/common/software/nersc9/nvidia_hpc_sdk/Linux_x86_64/21.11/cuda/11.5/bin/../targets/x86_64-linux/lib/stubs"
  "-L/global/common/software/nersc9/nvidia_hpc_sdk/Linux_x86_64/21.11/cuda/11.5/bin/../targets/x86_64-linux/lib"


  #$ CUDAFE_FLAGS=

  #$ PTXAS_FLAGS=

  #$ rm tmp/a_dlink.reg.c

  #$ "/opt/cray/pe/craype/2.7.20/bin"/CC -D__CUDA_ARCH__=520
  -D__CUDA_ARCH_LIST__=520 -E --nvcchost -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS
  -D__CUDACC__ -D__NVCC__
  "-I/global/common/software/nersc9/nvidia_hpc_sdk/Linux_x86_64/21.11/cuda/11.5/bin/../targets/x86_64-linux/include"
  -D__CUDACC_VER_MAJOR__=11 -D__CUDACC_VER_MINOR__=5
  -D__CUDACC_VER_BUILD__=119 -D__CUDA_API_VER_MAJOR__=11
  -D__CUDA_API_VER_MINOR__=5 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 --preinclude
  "cuda_runtime.h" -m64 "CMakeCUDACompilerId.cu" -o
  "tmp/CMakeCUDACompilerId.cpp1.ii"

  
  "/global/common/software/nersc9/nvidia_hpc_sdk/Linux_x86_64/21.11/cuda/11.5/bin/../targets/x86_64-linux/include/crt/host_config.h",
  line 118: catastrophic error: #error directive: -- unsupported pgc++
  configuration! Only pgc++ 18, 19, 20 and 21 are supported! The nvcc flag
  '-allow-unsupported-compiler' can be used to override this version check;
  however, using an unsupported host compiler may cause compilation failure
  or incorrect run time execution.  Use at your own risk.

    #error -- unsupported pgc++ configuration! Only pgc++ 18, 19, 20 and 21 are supported! The nvcc flag '-allow-unsupported-compiler' can be used to override this version check; however, using an unsupported host compiler may cause compilation failure or incorrect run time execution. Use at your own risk.
     ^

  

  1 catastrophic error detected in the compilation of
  "CMakeCUDACompilerId.cu".

  Compilation terminated.

  # --error 0x2 --

The text was updated successfully, but these errors were encountered:

xylar · 2024-01-29T20:42:15Z

@grnydawn, same deal as in #54. Again, I know it a lot. Since this one affects a GPU compiler we want to support, it's a bit higher priority than #54 and #52.

And, again, it's entirely possible I've made a mistake.

grnydawn · 2024-01-29T21:11:55Z

@xylar I also got the same issue on Perlmutter. I think I need to understand more about the impact of using the compiler configurations from CIME.

xylar · 2024-01-29T21:17:33Z

@grnydawn, no problem. I'm a total amateur at CMake so I really appreciate what you've done so far. We're all learning the process here.

grnydawn · 2024-01-30T19:18:02Z

@xylar , this issue is not shown when I use a newer cudatoolkit module(12.2) than one(11.5) in config_machines.xml. However, after passing this issue, the same issue with #54 occurs with ekat/yaml compilation.

xylar · 2024-07-07T15:35:57Z

@grnydawn, it looks like E3SM might be forced to move do a new cudatoolkit on Perlmutter-GPU soon, see https://acmeclimate.slack.com/archives/C021DPJEL9X/p1720035170508459.. Let's keep an eye on it and check back on this issue if that happens.

xylar · 2024-07-10T13:25:25Z

With #97, I'm able to build CTests but I'm seeing:

no CUDA-capable device is detected

over and over, specifically:

0: terminate called after throwing an instance of 'std::runtime_error'
0:   what():  cudaGetDeviceCount(&m_cudaDevCount) error( cudaErrorNoDevice): no CUDA-capable device is detected /global/u2/x/xylar/e3sm_work/polaris/main/e3sm_submodules/omega/develop/externals/ekat/extern/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:275

grnydawn · 2024-07-10T15:01:13Z

@xylar, I tried to reproduce this issue on pm-gpu, but I couldn't. I was able to build and run most, but not all, of the test cases without a problem on pm-gpu using the nvidiagpu compiler. One thing I want to check is whether an Omega scripts for build/run can use a relative path to source a file created in the E3SM case. The omega_build.sh and omega_ctest.sh scripts source "./e3smcase/.env_mach_specific.sh" before they execute. It seems that the sourcing was not completed.

xylar · 2024-07-10T15:06:33Z

Okay, I will try again and document my difficulties in more detail.

xylar · 2024-07-10T15:08:19Z

@grnydawn, are you able to build with develop? I wasn't even able to do that.

grnydawn · 2024-07-10T15:09:42Z

@xylar Ah, I ran my test using Phil's omega/sync-e3sm, not develop.

xylar · 2024-07-10T15:10:02Z

Okay, that's what I will continue doing as well.

xylar · 2024-07-10T15:10:14Z

I was asking because I can't even build with develop.

grnydawn · 2024-07-10T15:12:04Z

I am going to build using develop branch now.

xylar · 2024-07-10T15:12:41Z

Building seems to work fine (with sync-e3sm). Here's what I'm running:

#!/usr/bin/env bash

cwd=${PWD}

module load cmake
# quit on errors
set -e
# trace commands
set -x

cd /global/u2/x/xylar/e3sm_work/polaris/main/e3sm_submodules/omega/sync-e3sm

git submodule update --init --recursive externals/YAKL externals/ekat \
    externals/scorpio cime

cd ${cwd}

rm -rf build_omega/build_pm-gpu_nvidiagpu

mkdir -p build_omega/build_pm-gpu_nvidiagpu
cd build_omega/build_pm-gpu_nvidiagpu

export METIS_ROOT=/global/cfs/cdirs/e3sm/software/polaris/pm-gpu/spack/dev_polaris_0_3_0_nvidiagpu_mpich/var/spack/environments/dev_polaris_0_3_0_nvidiagpu_mpich/.spack-env/view
export PARMETIS_ROOT=/global/cfs/cdirs/e3sm/software/polaris/pm-gpu/spack/dev_polaris_0_3_0_nvidiagpu_mpich/var/spack/environments/dev_polaris_0_3_0_nvidiagpu_mpich/.spack-env/view

cmake \
   -DOMEGA_BUILD_TYPE=Release \
   -DOMEGA_CIME_COMPILER=nvidiagpu \
   -DOMEGA_CIME_MACHINE=pm-gpu \
   -DOMEGA_METIS_ROOT=${METIS_ROOT} \
   -DOMEGA_PARMETIS_ROOT=${PARMETIS_ROOT} \
   -DOMEGA_BUILD_TEST=ON \
   -Wno-dev \
   -S /global/u2/x/xylar/e3sm_work/polaris/main/e3sm_submodules/omega/sync-e3sm/components/omega \
   -B . 

./omega_build.sh

cd test

ln -sfn /global/cfs/cdirs/e3sm/polaris/ocean/omega_ctest/ocean.QU.240km.151209.nc OmegaMesh.nc
ln -sfn /global/cfs/cdirs/e3sm/polaris/ocean/omega_ctest/PlanarPeriodic48x48.nc OmegaPlanarMesh.nc
ln -sfn /global/cfs/cdirs/e3sm/polaris/ocean/omega_ctest/cosine_bell_icos480_initial_state.230220.nc OmegaSphereMesh.nc

xylar · 2024-07-10T15:14:28Z

My job script looks like this:

#!/bin/bash
#SBATCH  --job-name=omega_ctest_pm-gpu_nvidiagpu
#SBATCH  --account=e3sm
#SBATCH  --nodes=1
#SBATCH  --output=omega_ctest_pm-gpu_nvidiagpu.o%j
#SBATCH  --exclusive
#SBATCH  --time=0:15:00
#SBATCH  --qos=debug
#SBATCH  --constraint=gpu

cd /global/u2/x/xylar/e3sm_work/polaris/main/build_omega/build_pm-gpu_nvidiagpu
./omega_ctest.sh

@grnydawn, do you see any issues there?

grnydawn · 2024-07-10T15:15:47Z

@xylar Could you add "-DOMEGA_ARCH=CUDA" in the cmake command line. Omega build system may detect that it is CUDA build but I have to check if that is right or not.

xylar · 2024-07-10T15:17:28Z

I can add that manually now. How do I know in general with the correct OMEGA_ARCH is for a given compiler on a given machine?

grnydawn · 2024-07-10T15:27:19Z

I think that knowing the compiler and machine does not guarantee the user's intended build target architecture. However, the Omega build system checks if nvcc or hipcc is available on the system if OMEGA_ARCH is not specified, and tries to use one of them according to a certain compiler priority. But, in general, the Omega build system does not know the user's intended target architecture.

xylar · 2024-07-10T15:35:41Z

Okay, so is a user supposed to know their OMEGA_ARCH in addition to their machine and compiler? It seems like only -DOMEGA_ARCH=CUDA works for nvidiagpu and my guess is the same is true for gnugpu on pm-gpu. At least under Polaris, it would be nice if Omega users don't need to know what OMEGA_ARCH to use (or at least if there's a sensible default for every compiler). So what I'm asking then is now do I determine a sensible default for OMEGA_ARCH for all compilers and machines?

xylar · 2024-07-10T15:37:03Z

Also, it doesn't seem like detection of nvcc worked successfully since CUDA does not seem to have been detected. Maybe that's because it's not actually loaded into the environment I'm building from?

xylar · 2024-07-10T15:38:10Z

Unfortunately, rebuilding with -DOMEGA_ARCH=CUDA is producing the same errors for me as before:

2:   what():  (CudaInternal::singleton().cuda_get_device_count_wrapper<false>( &m_cudaDevCount)) error( cudaErrorNoDevice): no CUDA-capable device is detected /global/homes/x/xylar/e3sm_work/polaris/main/e3sm_submodules/omega/sync-e3sm/externals/ekat/extern/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:280

xylar · 2024-07-10T15:40:32Z

I'm going to move the conversation to Slack.

xylar · 2024-07-10T19:44:26Z

Okay, this worked for me when I added --gpus=4 to by job script. How embarrassing!!!

@grnydawn, I think we can close this issue.

grnydawn · 2024-07-10T21:01:23Z

@xylar, I am good to close this issue. Thanks for the work! BTW, we may continue to discuss about the usage of OMEGA_ARCH somewhere else.

xylar added bug Something isn't working CMake CMake-related issues labels Jan 29, 2024

xylar mentioned this issue Jan 29, 2024

Update to 0.3.0-alpha.1 E3SM-Project/polaris#177

Merged

39 tasks

xylar mentioned this issue Jul 10, 2024

Sync Omega develop with E3SM master #97

Merged

9 tasks

xylar closed this as completed Jul 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CMake failing for `nvidiagpu` on Perlmutter-GPU #55

CMake failing for `nvidiagpu` on Perlmutter-GPU #55

xylar commented Jan 29, 2024

xylar commented Jan 29, 2024

grnydawn commented Jan 29, 2024

xylar commented Jan 29, 2024

grnydawn commented Jan 30, 2024

xylar commented Jul 7, 2024

xylar commented Jul 10, 2024

grnydawn commented Jul 10, 2024 •

edited

Loading

xylar commented Jul 10, 2024

xylar commented Jul 10, 2024

grnydawn commented Jul 10, 2024

xylar commented Jul 10, 2024

xylar commented Jul 10, 2024

grnydawn commented Jul 10, 2024

xylar commented Jul 10, 2024

xylar commented Jul 10, 2024

grnydawn commented Jul 10, 2024

xylar commented Jul 10, 2024

grnydawn commented Jul 10, 2024 •

edited

Loading

xylar commented Jul 10, 2024

xylar commented Jul 10, 2024

xylar commented Jul 10, 2024

xylar commented Jul 10, 2024

xylar commented Jul 10, 2024

grnydawn commented Jul 10, 2024

CMake failing for nvidiagpu on Perlmutter-GPU #55

CMake failing for nvidiagpu on Perlmutter-GPU #55

Comments

xylar commented Jan 29, 2024

xylar commented Jan 29, 2024

grnydawn commented Jan 29, 2024

xylar commented Jan 29, 2024

grnydawn commented Jan 30, 2024

xylar commented Jul 7, 2024

xylar commented Jul 10, 2024

grnydawn commented Jul 10, 2024 • edited Loading

xylar commented Jul 10, 2024

xylar commented Jul 10, 2024

grnydawn commented Jul 10, 2024

xylar commented Jul 10, 2024

xylar commented Jul 10, 2024

grnydawn commented Jul 10, 2024

xylar commented Jul 10, 2024

xylar commented Jul 10, 2024

grnydawn commented Jul 10, 2024

xylar commented Jul 10, 2024

grnydawn commented Jul 10, 2024 • edited Loading

xylar commented Jul 10, 2024

xylar commented Jul 10, 2024

xylar commented Jul 10, 2024

xylar commented Jul 10, 2024

xylar commented Jul 10, 2024

grnydawn commented Jul 10, 2024

CMake failing for `nvidiagpu` on Perlmutter-GPU #55

CMake failing for `nvidiagpu` on Perlmutter-GPU #55

grnydawn commented Jul 10, 2024 •

edited

Loading

grnydawn commented Jul 10, 2024 •

edited

Loading