Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compiling on Slurmcluster fatal error: cudnn.h: No such file or directory #918

Open
windprak opened this issue Jun 12, 2024 · 3 comments
Open
Labels
bug Something isn't working build Build system

Comments

@windprak
Copy link

I try to compile TE on a slurmcluster because containers aren't fully supported (MPI issues).
My setup is like this:


module load cuda/12.4.1
module load cmake/3.23.1 
module load git/2.35.2 
module load gcc/12.1.0
module load cudnn/9.1.0.70-12.x

source $WORK/venvs/megatron/bin/activate
python -m pip install --force-reinstall setuptools==69.5.1.
python -m pip install nltk sentencepiece einops mpmath packaging numpy ninja wheel
pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu124
pip install wheel
MAX_JOBS=4 pip install flash-attn==2.4.2. --no-build-isolation
pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable

export CXXFLAGS=-isystem\ $CUDNN_ROOT/include
pip install git+https://github.com/NVIDIA/TransformerEngine.git@main  #or stable doesn't matter

All the variables echo well. I can build megatron-lm and apex in this environment, no problem. But not TE.

Error:

conda/envs/megatron/lib/python3.10/site-packages/torch/include/ATen/cudnn/cudnn-wrapper.h:3:10: fatal error: cudnn.h: No such file or directory
          3 | #include <cudnn.h>
            |          ^~~~~~~~~
@timmoon10 timmoon10 added bug Something isn't working build Build system labels Jun 13, 2024
@timmoon10
Copy link
Collaborator

It looks like PyTorch's C++ extensions are configured with CUDNN_HOME or CUDNN_PATH:
https://github.com/pytorch/pytorch/blob/5a80d2df844f9794b3b7ad91eddc7ba762760ad0/torch/utils/cpp_extension.py#L209
PyTorch's build is configured with CUDNN_ROOT:
https://github.com/pytorch/pytorch/blob/5a80d2df844f9794b3b7ad91eddc7ba762760ad0/cmake/Modules_CUDA_fix/FindCUDNN.cmake#L4

@ywb2018
Copy link

ywb2018 commented Jun 22, 2024

It looks like PyTorch's C++ extensions are configured with CUDNN_HOME or CUDNN_PATH: https://github.com/pytorch/pytorch/blob/5a80d2df844f9794b3b7ad91eddc7ba762760ad0/torch/utils/cpp_extension.py#L209 PyTorch's build is configured with CUDNN_ROOT: https://github.com/pytorch/pytorch/blob/5a80d2df844f9794b3b7ad91eddc7ba762760ad0/cmake/Modules_CUDA_fix/FindCUDNN.cmake#L4

so what i can do to handle this issue? please give a clear and simple answer thx!

@timmoon10
Copy link
Collaborator

timmoon10 commented Jun 25, 2024

export CUDNN_PATH=/path/to/cudnn
pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working build Build system
Projects
None yet
Development

No branches or pull requests

3 participants