Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify additional steps to utilize GPU for Linux users #2299

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

sgkouzias
Copy link

Specify additional steps to utilize GPU for Linux users

Specify additional steps to utilize GPU for Linux users
Advice to skip additional step 6 if using CPU.
@8bitmp3
Copy link
Contributor

8bitmp3 commented Apr 9, 2024

@MarkDaoust @markmcd

Added second option to create virtual env via Python's built in venv module for Linux users with CUDA-enabled GPUs
Added virtual envs activation/deactivation commands and changed wording for editing the deactivate block in the activate script of the venv virtual env.
Added instructions to resolve the ptxas issue.
Revised CUDNN_DIR definition
Corrected LD_LIBRARY_PATH definition in conda environment instructions
Rename environment variable to PTXAS_DIR and package manager options.
Added note to use pip instead of conda to install TensorFlow.
Copy link
Author

@sgkouzias sgkouzias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added steps and respective instructions to install TensorFlow by running the pip install tensorflow[and-cuda] command within a virtual environment (option 1: conda, option 2: venv) and set environment variables to find/locate compatible NVIDIA libs installed with TensorFlow to effectively utilize GPUs. The solution has been successfully tested.

Reference: tensorflow/tensorflow#63362

@sgkouzias
Copy link
Author

sgkouzias commented May 10, 2024

@haifeng-jin , @MarkDaoust, @8bitmp3 I await any suggestions or revisions if needed. Do we have any updates?

@sgkouzias sgkouzias marked this pull request as draft May 16, 2024 13:23
@sgkouzias sgkouzias marked this pull request as ready for review May 16, 2024 13:28
@haifeng-jin
Copy link
Collaborator

As I remembered, the current recommended way to install TF is to use pip. I do not have further info on this. @MarkDaoust may comment on this.

@sgkouzias
Copy link
Author

sgkouzias commented May 20, 2024

As I remembered, the current recommended way to install TF is to use pip. I do not have further info on this. @MarkDaoust may comment on this.

@haifeng-jin it seems practically impossible for someone owning a PC with CUDA-enabled GPU to perform deep learning experiments with TensorFlow version 2.16.1 and utilize his GPU locally without manually performing some extra steps not included (until today) in the official TensorFlow documentation of the standard installation procedure of TensorFlow for Linux users with GPUs at least as a temporal fix!

It turns out that when you pip install tensorflow[and-cuda] all required NVIDIA libraries are installed as well. You just need to configure manually the environment variables as appropriate in order to utilize them and run TensorFlow with GPU.

Copy link
Contributor

@mihaimaruseac mihaimaruseac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't use "add file"/"update file"/"fix file"/etc. commit messages. These are hard to reason about when looking at the history of the file/repository. Instead, please write explanatory git commit messages.

The commit message is also the title of the PR if the PR has only one commit. It is thus twice important to have commit messages that are relevant, as PRs would be easier to understand and easier to analyze in search results.

For how to write good quality git commit messages, please consult https://cbea.ms/git-commit/

@mihaimaruseac
Copy link
Contributor

It turns out that when you pip install tensorflow[and-cuda] all required NVIDIA libraries are installed as well. You just need to configure manually the environment variables as appropriate in order to utilize them and run TensorFlow with GPU.

Can we instead add these to the install guide?

@sgkouzias sgkouzias changed the title Update pip.md Specify additional steps to utilize GPU for Linux users May 24, 2024
@sgkouzias
Copy link
Author

configure manually the environment variables as appropriate

@mihaimaruseac shouldn't we explain/specify how to configure manually the environment variables as appropriate?

Copy link
Contributor

@mihaimaruseac mihaimaruseac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read the update and it seems reasonable to me. Thank you

@Tachi107
Copy link

Why is conda mentioned in this patch? It makes the install guide more convoluted and seems unnecessary to me.

@sgkouzias
Copy link
Author

Why is conda mentioned in this patch? It makes the install guide more convoluted and seems unnecessary to me.

@Tachi107 I agree. Should I proceed to erase everything related to conda refered as option 1 and just keep one suggested option (create a venv virtual environment)? Perhaps it would be better and more straight-forward?

@Tachi107
Copy link

Note that I'm not a tensorflow maintainer, just a casual user who happened to stumble upon this patch. But yeah, if I were you I would just show how to setup the venv. Conda users should already know how to do that with their non-default setup :)

@sgkouzias
Copy link
Author

Note that I'm not a tensorflow maintainer, just a casual user who happened to stumble upon this patch. But yeah, if I were you I would just show how to setup the venv. Conda users should already know how to do that with their non-default setup :)

@Tachi107 thank you. It seems very reasonable to simplify the guide like that. However for now I will keep it as is and await for the comments of the maintainers as well.

@sgkouzias
Copy link
Author

@haifeng-jin , @MarkDaoust, @8bitmp3 I await any suggestions or revisions if needed. Do we have any updates?

@t-kalinowski
Copy link

t-kalinowski commented Jun 17, 2024

There is no need to use conda, a standard venv works fine. In 2.15, tensorflow knew to go look for the NVIDIA binaries installed with pip. With TF 2.16, you can help it by placing the the binaries on LD_LIBRARY_PATH, like suggested in this PR, or by creating symlinks from the TF package to the pip installed nvidia packages. E.g.,

python -m venv my-venv
source my-venv/bin/activate
python -m pip install tensorflow[and-cuda]
pushd $(dirname $(python -c 'print(__import__("tensorflow").__file__)'))
ln -svf ../nvidia/*/lib/*.so* .
popd

This produces output like:

'./libcublasLt.so.12' -> '../nvidia/cublas/lib/libcublasLt.so.12'
'./libcublas.so.12' -> '../nvidia/cublas/lib/libcublas.so.12'
'./libnvblas.so.12' -> '../nvidia/cublas/lib/libnvblas.so.12'
'./libcheckpoint.so' -> '../nvidia/cuda_cupti/lib/libcheckpoint.so'
'./libcupti.so.12' -> '../nvidia/cuda_cupti/lib/libcupti.so.12'
'./libnvperf_host.so' -> '../nvidia/cuda_cupti/lib/libnvperf_host.so'
'./libnvperf_target.so' -> '../nvidia/cuda_cupti/lib/libnvperf_target.so'
'./libpcsamplingutil.so' -> '../nvidia/cuda_cupti/lib/libpcsamplingutil.so'
'./libnvrtc-builtins.so.12.3' -> '../nvidia/cuda_nvrtc/lib/libnvrtc-builtins.so.12.3'
'./libnvrtc.so.12' -> '../nvidia/cuda_nvrtc/lib/libnvrtc.so.12'
'./libcudart.so.12' -> '../nvidia/cuda_runtime/lib/libcudart.so.12'
'./libcudnn_adv_infer.so.8' -> '../nvidia/cudnn/lib/libcudnn_adv_infer.so.8'
'./libcudnn_adv_train.so.8' -> '../nvidia/cudnn/lib/libcudnn_adv_train.so.8'
'./libcudnn_cnn_infer.so.8' -> '../nvidia/cudnn/lib/libcudnn_cnn_infer.so.8'
'./libcudnn_cnn_train.so.8' -> '../nvidia/cudnn/lib/libcudnn_cnn_train.so.8'
'./libcudnn_ops_infer.so.8' -> '../nvidia/cudnn/lib/libcudnn_ops_infer.so.8'
'./libcudnn_ops_train.so.8' -> '../nvidia/cudnn/lib/libcudnn_ops_train.so.8'
'./libcudnn.so.8' -> '../nvidia/cudnn/lib/libcudnn.so.8'
'./libcufft.so.11' -> '../nvidia/cufft/lib/libcufft.so.11'
'./libcufftw.so.11' -> '../nvidia/cufft/lib/libcufftw.so.11'
'./libcurand.so.10' -> '../nvidia/curand/lib/libcurand.so.10'
'./libcusolverMg.so.11' -> '../nvidia/cusolver/lib/libcusolverMg.so.11'
'./libcusolver.so.11' -> '../nvidia/cusolver/lib/libcusolver.so.11'
'./libcusparse.so.12' -> '../nvidia/cusparse/lib/libcusparse.so.12'
'./libnccl.so.2' -> '../nvidia/nccl/lib/libnccl.so.2'
'./libnvJitLink.so.12' -> '../nvidia/nvjitlink/lib/libnvJitLink.so.12'

This is essentially what we do from the R interface in tensorflow::install_tensorflow() and keras3::install_keras()

Removed option to install within conda virtual environment. Recommendation to install in venv environment.
@sgkouzias
Copy link
Author

@t-kalinowski thank you very much for your valuable advice. I revised the PR accordingly.

@t-kalinowski
Copy link

@sgkouzias if you also create a symlink at my-venv/bin/ptxas -> my-venv/lib/python.../site-packages/.../bin/ptxax, then you could probably get away without needing to require users to modify default activate and deactivate scripts.

Replaced instructions to modify default activate/deactivate scripts with instructions to create symlinks to NVIDIA shared libraries and ptxas.
@sgkouzias
Copy link
Author

@sgkouzias if you also create a symlink at my-venv/bin/ptxas -> my-venv/lib/python.../site-packages/.../bin/ptxax, then you could probably get away without needing to require users to modify default activate and deactivate scripts.

@t-kalinowski thank you so much for your advice. Instructions have been totally revised as per your comments. Modifications to default activate and deactivate scripts are not required from users. Instructions should resemble more or less what you do in the R interface.

@sgkouzias
Copy link
Author

sgkouzias commented Jun 19, 2024

@8bitmp3 , @haifeng-jin , @MarkDaoust even TensorFlow version 2.17.0.rc0 requires to specify additional steps to utilize GPU for Linux users. The suggested instructions of this pull request offer a tested solution. I await your comments.


```bash
source tf/bin/activate
deactivate

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove deactivate?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove deactivate?

@learning-to-play removed deactivate as advised. Furthermore, I could remove the instruction to create symlink to ptxas since it is ultimately not needed for TensorFlow version 2.17.0.rc0 but only for TensorFlow version 2.16.1. Awaiting your comments.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to make sure that I understand the situation correctly. Which of the following two situation is correct?

Copy link
Author

@sgkouzias sgkouzias Jun 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@learning-to-play the only difference is that on version 2.17.0.rc0 you need to create the symlinks to NVIDIA libs in order to utilize GPUs while on version 2.16.1 you should in addition to creating symlinks to NVIDIA libs create a symlink to ptxas as well. Consequently, the command pip install tensorflow[and-cuda] alone fails to work with GPUs on both versions.

@sgkouzias
Copy link
Author

sgkouzias commented Jul 1, 2024

@learning-to-play, @SeeForTwo, @8bitmp3, @haifeng-jin, @MarkDaoust, @markmcd

Unfortunately the latest release namely TensorFlow 2.16.2 does not fix the ptxas bug. When running a training script I get the error:

ptxas returned an error during compilation of ptx to sass: 'INTERNAL: ptxas 12.3.103 has a bug that we think can affect XLA. Please use a different version.' If the error message indicates that a file could not be written, please verify that sufficient filesystem space is provided.
Aborted (core dumped)

So it seems as TensorFlow 2.16.2 Fails to work with GPUs as well !

Notes:

  1. Successful installation was verified by running:
    python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
  2. The solution included in the submitted pull request pending review helped to get rid of the ptxas bug and ultimately enforced TensorFlow 2.16.2 to work with my GPU:
ln -sf $(find $(dirname $(dirname $(python -c "import nvidia.cuda_nvcc; print(nvidia.cuda_nvcc.__file__)"))/*/bin/) -name ptxas -print -quit) $VIRTUAL_ENV/bin/ptxas

@belitskiy
Copy link
Member

Thank you for the contribution, @sgkouzias :)
Given that the [and-cuda] installation now does detect pip-installed CUDA components again, please add a disclaimer specify that that symbolic links are only necessary in case the intended way doesn't work, i.e. the components aren't being detected, and/or conflict with the existing system CUDA installation (like ptxas for you).

Revised the step with instructions to configure the virtual environment variables for GPU users by adding a disclaimer.
@sgkouzias
Copy link
Author

Thank you for the contribution, @sgkouzias :) Given that the [and-cuda] installation now does detect pip-installed CUDA components again, please add a disclaimer specify that that symbolic links are only necessary in case the intended way doesn't work, i.e. the components aren't being detected, and/or conflict with the existing system CUDA installation (like ptxas for you).

@belitskiy, @learning-to-play I revised instructions as advised and will be awaiting your feedback. It is my honor to contribute to the TensorFlow community.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
10 participants