Specify additional steps to utilize GPU for Linux users #2299

sgkouzias · 2024-04-08T11:28:15Z

Specify additional steps to utilize GPU for Linux users

Advice to skip additional step 6 if using CPU.

8bitmp3 · 2024-04-09T15:59:21Z

Added second option to create virtual env via Python's built in venv module for Linux users with CUDA-enabled GPUs

Added virtual envs activation/deactivation commands and changed wording for editing the deactivate block in the activate script of the venv virtual env.

Added instructions to resolve the ptxas issue.

Revised CUDNN_DIR definition

Corrected LD_LIBRARY_PATH definition in conda environment instructions

Rename environment variable to PTXAS_DIR and package manager options.

Added note to use pip instead of conda to install TensorFlow.

sgkouzias

Added steps and respective instructions to install TensorFlow by running the pip install tensorflow[and-cuda] command within a virtual environment (option 1: conda, option 2: venv) and set environment variables to find/locate compatible NVIDIA libs installed with TensorFlow to effectively utilize GPUs. The solution has been successfully tested.

Reference: tensorflow/tensorflow#63362

sgkouzias · 2024-05-10T11:06:22Z

@haifeng-jin , @MarkDaoust, @8bitmp3 I await any suggestions or revisions if needed. Do we have any updates?

haifeng-jin · 2024-05-20T18:23:44Z

As I remembered, the current recommended way to install TF is to use pip. I do not have further info on this. @MarkDaoust may comment on this.

sgkouzias · 2024-05-20T18:34:20Z

As I remembered, the current recommended way to install TF is to use pip. I do not have further info on this. @MarkDaoust may comment on this.

@haifeng-jin it seems practically impossible for someone owning a PC with CUDA-enabled GPU to perform deep learning experiments with TensorFlow version 2.16.1 and utilize his GPU locally without manually performing some extra steps not included (until today) in the official TensorFlow documentation of the standard installation procedure of TensorFlow for Linux users with GPUs at least as a temporal fix!

It turns out that when you pip install tensorflow[and-cuda] all required NVIDIA libraries are installed as well. You just need to configure manually the environment variables as appropriate in order to utilize them and run TensorFlow with GPU.

mihaimaruseac

Please don't use "add file"/"update file"/"fix file"/etc. commit messages. These are hard to reason about when looking at the history of the file/repository. Instead, please write explanatory git commit messages.

The commit message is also the title of the PR if the PR has only one commit. It is thus twice important to have commit messages that are relevant, as PRs would be easier to understand and easier to analyze in search results.

For how to write good quality git commit messages, please consult https://cbea.ms/git-commit/

mihaimaruseac · 2024-05-23T21:58:24Z

It turns out that when you pip install tensorflow[and-cuda] all required NVIDIA libraries are installed as well. You just need to configure manually the environment variables as appropriate in order to utilize them and run TensorFlow with GPU.

Can we instead add these to the install guide?

sgkouzias · 2024-05-24T13:11:21Z

configure manually the environment variables as appropriate

@mihaimaruseac shouldn't we explain/specify how to configure manually the environment variables as appropriate?

mihaimaruseac

I read the update and it seems reasonable to me. Thank you

Tachi107 · 2024-06-12T17:04:15Z

Why is conda mentioned in this patch? It makes the install guide more convoluted and seems unnecessary to me.

sgkouzias · 2024-06-12T17:42:58Z

Why is conda mentioned in this patch? It makes the install guide more convoluted and seems unnecessary to me.

@Tachi107 I agree. Should I proceed to erase everything related to conda refered as option 1 and just keep one suggested option (create a venv virtual environment)? Perhaps it would be better and more straight-forward?

Tachi107 · 2024-06-12T18:03:03Z

Note that I'm not a tensorflow maintainer, just a casual user who happened to stumble upon this patch. But yeah, if I were you I would just show how to setup the venv. Conda users should already know how to do that with their non-default setup :)

sgkouzias · 2024-06-12T18:32:56Z

Note that I'm not a tensorflow maintainer, just a casual user who happened to stumble upon this patch. But yeah, if I were you I would just show how to setup the venv. Conda users should already know how to do that with their non-default setup :)

@Tachi107 thank you. It seems very reasonable to simplify the guide like that. However for now I will keep it as is and await for the comments of the maintainers as well.

sgkouzias · 2024-06-17T09:33:01Z

@haifeng-jin , @MarkDaoust, @8bitmp3 I await any suggestions or revisions if needed. Do we have any updates?

t-kalinowski · 2024-06-17T16:43:44Z

There is no need to use conda, a standard venv works fine. In 2.15, tensorflow knew to go look for the NVIDIA binaries installed with pip. With TF 2.16, you can help it by placing the the binaries on LD_LIBRARY_PATH, like suggested in this PR, or by creating symlinks from the TF package to the pip installed nvidia packages. E.g.,

python -m venv my-venv
source my-venv/bin/activate
python -m pip install tensorflow[and-cuda]
pushd $(dirname $(python -c 'print(__import__("tensorflow").__file__)'))
ln -svf ../nvidia/*/lib/*.so* .
popd

This produces output like:

'./libcublasLt.so.12' -> '../nvidia/cublas/lib/libcublasLt.so.12'
'./libcublas.so.12' -> '../nvidia/cublas/lib/libcublas.so.12'
'./libnvblas.so.12' -> '../nvidia/cublas/lib/libnvblas.so.12'
'./libcheckpoint.so' -> '../nvidia/cuda_cupti/lib/libcheckpoint.so'
'./libcupti.so.12' -> '../nvidia/cuda_cupti/lib/libcupti.so.12'
'./libnvperf_host.so' -> '../nvidia/cuda_cupti/lib/libnvperf_host.so'
'./libnvperf_target.so' -> '../nvidia/cuda_cupti/lib/libnvperf_target.so'
'./libpcsamplingutil.so' -> '../nvidia/cuda_cupti/lib/libpcsamplingutil.so'
'./libnvrtc-builtins.so.12.3' -> '../nvidia/cuda_nvrtc/lib/libnvrtc-builtins.so.12.3'
'./libnvrtc.so.12' -> '../nvidia/cuda_nvrtc/lib/libnvrtc.so.12'
'./libcudart.so.12' -> '../nvidia/cuda_runtime/lib/libcudart.so.12'
'./libcudnn_adv_infer.so.8' -> '../nvidia/cudnn/lib/libcudnn_adv_infer.so.8'
'./libcudnn_adv_train.so.8' -> '../nvidia/cudnn/lib/libcudnn_adv_train.so.8'
'./libcudnn_cnn_infer.so.8' -> '../nvidia/cudnn/lib/libcudnn_cnn_infer.so.8'
'./libcudnn_cnn_train.so.8' -> '../nvidia/cudnn/lib/libcudnn_cnn_train.so.8'
'./libcudnn_ops_infer.so.8' -> '../nvidia/cudnn/lib/libcudnn_ops_infer.so.8'
'./libcudnn_ops_train.so.8' -> '../nvidia/cudnn/lib/libcudnn_ops_train.so.8'
'./libcudnn.so.8' -> '../nvidia/cudnn/lib/libcudnn.so.8'
'./libcufft.so.11' -> '../nvidia/cufft/lib/libcufft.so.11'
'./libcufftw.so.11' -> '../nvidia/cufft/lib/libcufftw.so.11'
'./libcurand.so.10' -> '../nvidia/curand/lib/libcurand.so.10'
'./libcusolverMg.so.11' -> '../nvidia/cusolver/lib/libcusolverMg.so.11'
'./libcusolver.so.11' -> '../nvidia/cusolver/lib/libcusolver.so.11'
'./libcusparse.so.12' -> '../nvidia/cusparse/lib/libcusparse.so.12'
'./libnccl.so.2' -> '../nvidia/nccl/lib/libnccl.so.2'
'./libnvJitLink.so.12' -> '../nvidia/nvjitlink/lib/libnvJitLink.so.12'

This is essentially what we do from the R interface in tensorflow::install_tensorflow() and keras3::install_keras()

Removed option to install within conda virtual environment. Recommendation to install in venv environment.

sgkouzias · 2024-06-17T19:37:59Z

@t-kalinowski thank you very much for your valuable advice. I revised the PR accordingly.

t-kalinowski · 2024-06-17T20:20:13Z

@sgkouzias if you also create a symlink at my-venv/bin/ptxas -> my-venv/lib/python.../site-packages/.../bin/ptxax, then you could probably get away without needing to require users to modify default activate and deactivate scripts.

Replaced instructions to modify default activate/deactivate scripts with instructions to create symlinks to NVIDIA shared libraries and ptxas.

sgkouzias · 2024-06-18T16:16:54Z

@sgkouzias if you also create a symlink at my-venv/bin/ptxas -> my-venv/lib/python.../site-packages/.../bin/ptxax, then you could probably get away without needing to require users to modify default activate and deactivate scripts.

@t-kalinowski thank you so much for your advice. Instructions have been totally revised as per your comments. Modifications to default activate and deactivate scripts are not required from users. Instructions should resemble more or less what you do in the R interface.

sgkouzias · 2024-06-19T16:23:51Z

@8bitmp3 , @haifeng-jin , @MarkDaoust even TensorFlow version 2.17.0.rc0 requires to specify additional steps to utilize GPU for Linux users. The suggested instructions of this pull request offer a tested solution. I await your comments.

learning-to-play · 2024-06-19T17:55:55Z

site/en/install/pip.md

+
+    ```bash
+    source tf/bin/activate
+    deactivate


Can you remove deactivate?

Can you remove deactivate?

@learning-to-play removed deactivate as advised. Furthermore, I could remove the instruction to create symlink to ptxas since it is ultimately not needed for TensorFlow version 2.17.0.rc0 but only for TensorFlow version 2.16.1. Awaiting your comments.

I want to make sure that I understand the situation correctly. Which of the following two situation is correct?

If the issue doesn't happen for 2.17.0RC0, yes please remove the instructions.

If the issue happens for both 2.17.0RC0 and 2.16, we can wait for the GPU team to take a look at TF 2.17.0 RC0 Fails to work with GPUs (and TF 2.16 too) tensorflow#63362 and see if the can send a fix for both 2.16.2 and 2.17.0 release.

@learning-to-play the only difference is that on version 2.17.0.rc0 you need to create the symlinks to NVIDIA libs in order to utilize GPUs while on version 2.16.1 you should in addition to creating symlinks to NVIDIA libs create a symlink to ptxas as well. Consequently, the command pip install tensorflow[and-cuda] alone fails to work with GPUs on both versions.

sgkouzias · 2024-07-01T11:32:21Z

@learning-to-play, @SeeForTwo, @8bitmp3, @haifeng-jin, @MarkDaoust, @markmcd

Unfortunately the latest release namely TensorFlow 2.16.2 does not fix the ptxas bug. When running a training script I get the error:

ptxas returned an error during compilation of ptx to sass: 'INTERNAL: ptxas 12.3.103 has a bug that we think can affect XLA. Please use a different version.' If the error message indicates that a file could not be written, please verify that sufficient filesystem space is provided.
Aborted (core dumped)

So it seems as TensorFlow 2.16.2 Fails to work with GPUs as well !

Notes:

Successful installation was verified by running:
python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
The solution included in the submitted pull request pending review helped to get rid of the ptxas bug and ultimately enforced TensorFlow 2.16.2 to work with my GPU:

ln -sf $(find $(dirname $(dirname $(python -c "import nvidia.cuda_nvcc; print(nvidia.cuda_nvcc.__file__)"))/*/bin/) -name ptxas -print -quit) $VIRTUAL_ENV/bin/ptxas

belitskiy · 2024-07-02T13:10:12Z

Thank you for the contribution, @sgkouzias :)
Given that the [and-cuda] installation now does detect pip-installed CUDA components again, please add a disclaimer specify that that symbolic links are only necessary in case the intended way doesn't work, i.e. the components aren't being detected, and/or conflict with the existing system CUDA installation (like ptxas for you).

Revised the step with instructions to configure the virtual environment variables for GPU users by adding a disclaimer.

sgkouzias · 2024-07-02T16:48:14Z

Thank you for the contribution, @sgkouzias :) Given that the [and-cuda] installation now does detect pip-installed CUDA components again, please add a disclaimer specify that that symbolic links are only necessary in case the intended way doesn't work, i.e. the components aren't being detected, and/or conflict with the existing system CUDA installation (like ptxas for you).

@belitskiy, @learning-to-play I revised instructions as advised and will be awaiting your feedback. It is my honor to contribute to the TensorFlow community.

Update pip.md

c494be5

Specify additional steps to utilize GPU for Linux users

sgkouzias requested review from haifeng-jin, MarkDaoust and 8bitmp3 as code owners April 8, 2024 11:28

Update pip.md

40824e4

Advice to skip additional step 6 if using CPU.

8bitmp3 assigned markmcd Apr 9, 2024

sgkouzias added 5 commits April 9, 2024 21:58

Update pip.md

840fec9

Added second option to create virtual env via Python's built in venv module for Linux users with CUDA-enabled GPUs

Update pip.md

5448363

Added virtual envs activation/deactivation commands and changed wording for editing the deactivate block in the activate script of the venv virtual env.

Update pip.md

c7518a2

Added instructions to resolve the ptxas issue.

Update pip.md

6a40fe4

Revised CUDNN_DIR definition

Update pip.md

82713bf

Corrected LD_LIBRARY_PATH definition in conda environment instructions

8bitmp3 assigned MarkDaoust Apr 11, 2024

sgkouzias added 2 commits April 12, 2024 10:04

Update pip.md

b81e4f2

Rename environment variable to PTXAS_DIR and package manager options.

Update pip.md

aebf305

Added note to use pip instead of conda to install TensorFlow.

sgkouzias commented Apr 13, 2024

View reviewed changes

sgkouzias marked this pull request as draft May 16, 2024 13:23

sgkouzias marked this pull request as ready for review May 16, 2024 13:28

mihaimaruseac reviewed May 23, 2024

View reviewed changes

sgkouzias changed the title ~~Update pip.md~~ Specify additional steps to utilize GPU for Linux users May 24, 2024

sgkouzias requested a review from mihaimaruseac May 24, 2024 13:12

mihaimaruseac approved these changes May 24, 2024

View reviewed changes

Simplify procedure by removing option to install with conda virtual env.

146bbeb

Removed option to install within conda virtual environment. Recommendation to install in venv environment.

Instructions to create symlinks to NVIDIA shared libraries and ptxas.

7cf1c57

Replaced instructions to modify default activate/deactivate scripts with instructions to create symlinks to NVIDIA shared libraries and ptxas.

learning-to-play reviewed Jun 19, 2024

View reviewed changes

Removed deactivate command.

3046cdc

learning-to-play requested a review from SeeForTwo June 19, 2024 18:45

Instructions to create symlinks in case the intended way doesn't work.

7f5cce6

Revised the step with instructions to configure the virtual environment variables for GPU users by adding a disclaimer.

sgkouzias requested a review from learning-to-play July 2, 2024 16:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specify additional steps to utilize GPU for Linux users #2299

Specify additional steps to utilize GPU for Linux users #2299

sgkouzias commented Apr 8, 2024

8bitmp3 commented Apr 9, 2024

sgkouzias left a comment •

edited

Loading

sgkouzias commented May 10, 2024 •

edited

Loading

haifeng-jin commented May 20, 2024

sgkouzias commented May 20, 2024 •

edited

Loading

mihaimaruseac left a comment

mihaimaruseac commented May 23, 2024

sgkouzias commented May 24, 2024

mihaimaruseac left a comment

Tachi107 commented Jun 12, 2024

sgkouzias commented Jun 12, 2024

Tachi107 commented Jun 12, 2024

sgkouzias commented Jun 12, 2024

sgkouzias commented Jun 17, 2024

t-kalinowski commented Jun 17, 2024 •

edited

Loading

sgkouzias commented Jun 17, 2024

t-kalinowski commented Jun 17, 2024

sgkouzias commented Jun 18, 2024

sgkouzias commented Jun 19, 2024 •

edited

Loading

learning-to-play Jun 19, 2024

sgkouzias Jun 19, 2024

learning-to-play Jun 19, 2024

sgkouzias Jun 19, 2024 •

edited

Loading

sgkouzias commented Jul 1, 2024 •

edited

Loading

belitskiy commented Jul 2, 2024

sgkouzias commented Jul 2, 2024

Specify additional steps to utilize GPU for Linux users #2299

Are you sure you want to change the base?

Specify additional steps to utilize GPU for Linux users #2299

Conversation

sgkouzias commented Apr 8, 2024

8bitmp3 commented Apr 9, 2024

sgkouzias left a comment • edited Loading

Choose a reason for hiding this comment

sgkouzias commented May 10, 2024 • edited Loading

haifeng-jin commented May 20, 2024

sgkouzias commented May 20, 2024 • edited Loading

mihaimaruseac left a comment

Choose a reason for hiding this comment

mihaimaruseac commented May 23, 2024

sgkouzias commented May 24, 2024

mihaimaruseac left a comment

Choose a reason for hiding this comment

Tachi107 commented Jun 12, 2024

sgkouzias commented Jun 12, 2024

Tachi107 commented Jun 12, 2024

sgkouzias commented Jun 12, 2024

sgkouzias commented Jun 17, 2024

t-kalinowski commented Jun 17, 2024 • edited Loading

sgkouzias commented Jun 17, 2024

t-kalinowski commented Jun 17, 2024

sgkouzias commented Jun 18, 2024

sgkouzias commented Jun 19, 2024 • edited Loading

learning-to-play Jun 19, 2024

Choose a reason for hiding this comment

sgkouzias Jun 19, 2024

Choose a reason for hiding this comment

learning-to-play Jun 19, 2024

Choose a reason for hiding this comment

sgkouzias Jun 19, 2024 • edited Loading

Choose a reason for hiding this comment

sgkouzias commented Jul 1, 2024 • edited Loading

belitskiy commented Jul 2, 2024

sgkouzias commented Jul 2, 2024

sgkouzias left a comment •

edited

Loading

sgkouzias commented May 10, 2024 •

edited

Loading

sgkouzias commented May 20, 2024 •

edited

Loading

t-kalinowski commented Jun 17, 2024 •

edited

Loading

sgkouzias commented Jun 19, 2024 •

edited

Loading

sgkouzias Jun 19, 2024 •

edited

Loading

sgkouzias commented Jul 1, 2024 •

edited

Loading