Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linker Error with Torch 2.4.0+cu124 (ImportError: libnccl.so.2: cannot open shared object file: No such file or directory) #2125

Closed
axbycc-mark opened this issue Aug 17, 2024 · 5 comments

Comments

@axbycc-mark
Copy link

axbycc-mark commented Aug 17, 2024

🐞 bug report

Affected Rule

"@rules_python//python/extensions:pip.bzl" and any py_binary rules which have dependency "@pip//torch:pkg" and cause import torch to be called.

Is this a regression?

Yes, the previous version in which this bug was not present was: ....

Description

Torch ships with its own nvidia drivers which end up, for example, at site-packages/nvidia/nccl/lib/libnccl.so.2. Torch is usually installed into site-packages/torch and there is a library here site-packages/torch/_C.cpython-311-x86_64-linux-gnu.so which declares a dependency on libnccl through the rpath mechanism. See the output of readelf below.

readelf -d _C.cpython-311-x86_64-linux-gnu.so | grep 'rpath\|runpath'
 0x000000000000000f (RPATH)              Library rpath: [$ORIGIN/../../nvidia/cublas/lib:$ORIGIN/../../nvidia/cuda_cupti/lib:$ORIGIN/../../nvidia/cuda_nvrtc/lib:$ORIGIN/../../nvidia/cuda_runtime/lib:$ORIGIN/../../nvidia/cudnn/lib:$ORIGIN/../../nvidia/cufft/lib:$ORIGIN/../../nvidia/curand/lib:$ORIGIN/../../nvidia/cusolver/lib:$ORIGIN/../../nvidia/cusparse/lib:$ORIGIN/../../nvidia/nccl/lib:$ORIGIN/../../nvidia/nvtx/lib:$ORIGIN:$ORIGIN/lib]

However, within a Bazel project, the pip extension puts all packages in their own directories so the linker is not able to find the nvidia drivers using relative paths. What ends up happening is that the linker either ends up finding the system drivers at /usr/local/... or else the Python process raises an ImportError due to the linker failing.

To resolve the issue, I think Bazel would have to symlink Torch's dependencies into Torch's site-packages directory.

🔬 Minimal Reproduction

In MODULE.bazel

pip.parse(
    hub_name = "pip",
    python_version = python_version,
    requirements_lock = "//:requirements.txt",
    requirements_windows = "//:requirements_windows.txt",
)
use_repo(pip, "pip")

Then in the requirements file

torch==2.4.0+cu124

Then try to import torch from any py_binary with dependency on "@pip//torch:pkg".

🔥 Exception or Error

.../rules_python~~pip~pip_311_torch/site-packages/torch/__init__.py", line 294, in <module>
    from torch._C import *  # noqa: F403
    ^^^^^^^^^^^^^^^^^^^^^^
ImportError: libnccl.so.2: cannot open shared object file: No such file or directory

🌍 Your Environment

Operating System:

  
Ubuntu 22.04.4 LTS
  

Output of bazel version:

  
7.3.0
  

Rules_python version:

  
0.31.0
  

Anything else relevant?

@groodt
Copy link
Collaborator

groodt commented Aug 17, 2024

Your understanding of the issue is correct.

There are 2 main issues that crop up by the lack of a standard site-packages layout:

  • packages like the nvidia libraries which assume a site-packages layout and sibling packages that can be linked by traversing the filesystem
  • namespace packages also fail to import

We've recently been discussing these issues in our maintainers meeting. We have a few ideas to explore but nothing is in progress right now.

For now, these issues can be worked around by:

  • patching the torch wheel to preload the nvidia libraries
  • consider alternative rules that do use a site-packages layout, such as rules_py

@axbycc-mark
Copy link
Author

Thanks @groodt I've tried the rules_py but it doesn't seem to play nice with the pybind11_bazel project. Can you elaborate on the solution of patching the torch wheel to preload nvidia libraries?

My current workaround is to run this symlinking script, which I verified fixed the issue, just in case anyone else needs a quick unblock. Unfortunately, the paths are hard coded for my one particular py_binary, and the toolchain, and the pip repo.

import os
from pathlib import Path
import glob

# Define your base directory
base_dir = Path("bazel-bin/python/cross_cloud_predictor/view.runfiles/rules_python~~pip~pip_311_torch/site-packages")

# Use globbing to find all directories matching the pattern
nvidia_dirs = glob.glob("external/rules_python~~pip~pip_311_nvidia*/site-packages/")


# Function to create symlinks at the file level
def create_symlinks(source_dir, target_dir):
    for root, _, files in os.walk(source_dir):
        for file in files:
            source_file = Path(root) / file
            relative_path = source_file.relative_to(source_dir)
            target_file = target_dir / relative_path

            source_file = source_file.absolute()
            target_file = target_file.absolute()

            # Ensure the target directory exists
            target_file.parent.mkdir(parents=True, exist_ok=True)

            # If the file already exists as a symlink, remove it
            if target_file.is_symlink() or target_file.exists():
                target_file.unlink()

            # Create the symlink
            target_file.symlink_to(source_file)

# # Symlink files from each nvidia directory to the torch directory
for nvidia_dir in nvidia_dirs:
    create_symlinks(Path(nvidia_dir), base_dir)

Your team has probably thought about this, but maybe a similar type script in Starlark that merges all dependencies into a giant virtual site-packages directory could work.

@groodt
Copy link
Collaborator

groodt commented Aug 17, 2024

Heres a public example of a patch that uses symlinks: pytorch/pytorch#117350 (comment)

Here's a public example of a patch that preloads the dynamically linked libraries: pytorch/pytorch#101314 (comment)

The patch I'm carrying at $dayjob is very similar to the second example, just slightly more complicated.

Preloading means the dynamic linker wont need to look up the named libraries again in the process.

script in Starlark that merges all dependencies into a giant virtual site-packages directory could work.

This is similar to one of our ideas. There's pros and cons and performance considerations to work out.

I really do want to fix the site-packages issue. It comes up often and creates a lot of support requests. It feels like one of the last major missing pieces in the Python rules.

@axbycc-mark
Copy link
Author

Thanks so much these links are very helpful. Meanwhile I've also figured out how to get the bazel_pybind11 pybind_extension working with Aspect's py_rules. You just have to include the pybind_extension target name in the py_binary/library's srcs, not deps.

@groodt
Copy link
Collaborator

groodt commented Aug 24, 2024

Closing as duplicate of tracking issue #2156

@groodt groodt closed this as completed Aug 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants