-
Notifications
You must be signed in to change notification settings - Fork 543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Linker Error with Torch 2.4.0+cu124 (ImportError: libnccl.so.2: cannot open shared object file: No such file or directory) #2125
Comments
Your understanding of the issue is correct. There are 2 main issues that crop up by the lack of a standard site-packages layout:
We've recently been discussing these issues in our maintainers meeting. We have a few ideas to explore but nothing is in progress right now. For now, these issues can be worked around by:
|
Thanks @groodt I've tried the rules_py but it doesn't seem to play nice with the pybind11_bazel project. Can you elaborate on the solution of patching the torch wheel to preload nvidia libraries? My current workaround is to run this symlinking script, which I verified fixed the issue, just in case anyone else needs a quick unblock. Unfortunately, the paths are hard coded for my one particular py_binary, and the toolchain, and the pip repo.
Your team has probably thought about this, but maybe a similar type script in Starlark that merges all dependencies into a giant virtual site-packages directory could work. |
Heres a public example of a patch that uses symlinks: pytorch/pytorch#117350 (comment) Here's a public example of a patch that preloads the dynamically linked libraries: pytorch/pytorch#101314 (comment) The patch I'm carrying at $dayjob is very similar to the second example, just slightly more complicated. Preloading means the dynamic linker wont need to look up the named libraries again in the process.
This is similar to one of our ideas. There's pros and cons and performance considerations to work out. I really do want to fix the site-packages issue. It comes up often and creates a lot of support requests. It feels like one of the last major missing pieces in the Python rules. |
Thanks so much these links are very helpful. Meanwhile I've also figured out how to get the bazel_pybind11 pybind_extension working with Aspect's py_rules. You just have to include the pybind_extension target name in the py_binary/library's srcs, not deps. |
Closing as duplicate of tracking issue #2156 |
🐞 bug report
Affected Rule
"@rules_python//python/extensions:pip.bzl" and any py_binary rules which have dependency "@pip//torch:pkg" and cause
import torch
to be called.Is this a regression?
Yes, the previous version in which this bug was not present was: ....Description
Torch ships with its own nvidia drivers which end up, for example, at
site-packages/nvidia/nccl/lib/libnccl.so.2
. Torch is usually installed intosite-packages/torch
and there is a library heresite-packages/torch/_C.cpython-311-x86_64-linux-gnu.so
which declares a dependency on libnccl through the rpath mechanism. See the output of readelf below.However, within a Bazel project, the pip extension puts all packages in their own directories so the linker is not able to find the nvidia drivers using relative paths. What ends up happening is that the linker either ends up finding the system drivers at
/usr/local/...
or else the Python process raises an ImportError due to the linker failing.To resolve the issue, I think Bazel would have to symlink Torch's dependencies into Torch's site-packages directory.
🔬 Minimal Reproduction
In MODULE.bazel
Then in the requirements file
Then try to import torch from any py_binary with dependency on "@pip//torch:pkg".
🔥 Exception or Error
🌍 Your Environment
Operating System:
Output of
bazel version
:Rules_python version:
Anything else relevant?
The text was updated successfully, but these errors were encountered: