Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tensorflow v2.18.0 (redux) #414

Merged
merged 38 commits into from
Feb 9, 2025
Merged

tensorflow v2.18.0 (redux) #414

merged 38 commits into from
Feb 9, 2025

Conversation

h-vetinari
Copy link
Member

Rebased from #412

regro-cf-autotick-bot and others added 28 commits October 25, 2024 04:08
jaxlib may need the same fix.
@hmaarrfk
Copy link
Contributor

hmaarrfk commented Feb 2, 2025

Github is kinda terrible about this, but I'm thinking that something is wrong with the hermetic cuda stuff:
#414 (comment)

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Feb 2, 2025

It might be that we want to do:

# https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/tf_sig_build_dockerfiles/setup.cuda.sh#L28
ln -sf ${BUILD_PREFIX}/targets/x86_64-linux/lib/stubs/libcuda.so ${PREFIX}/lib/libcuda.so.1

instead similar to the source I refer to

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Feb 3, 2025

I'm still not happy that I can't import cuda tensorflow on a system that has no CUDA capable GPU.

On tensorflow 2.17, the following is blank:

ldd ${CONDA_PREFIX}/lib/python3.1/site-packages/tensorflow/python/platform/_pywrap_cpu_feature_guard.so  | grep libcuda

I'm going to be trying with the following patch:

diff --git a/third_party/gpus/cuda/hermetic/cuda_driver.BUILD.tpl b/third_party/gpus/cuda/hermetic/cuda_driver.BUILD.tpl
index 528a1db7..6b280aa9 100644
--- a/third_party/gpus/cuda/hermetic/cuda_driver.BUILD.tpl
+++ b/third_party/gpus/cuda/hermetic/cuda_driver.BUILD.tpl
@@ -8,11 +8,6 @@ cc_import(
     shared_library = "lib/libcuda.so.%{libcuda_version}",
 )

-cc_import(
-    name = "libcuda_so_1",
-    shared_library = "lib/libcuda.so.1",
-)
-
 # TODO(ybaturina): remove workaround when min CUDNN version in JAX is updated to
 # 9.3.0.
 # Workaround for adding path of driver library symlink to RPATH of cc_binaries.
@@ -45,7 +40,6 @@ cc_library(
     %{comment}deps = [
         %{comment}":libcuda_so",
         %{comment}":fake_libcuda",
-        %{comment}":libcuda_so_1",
         %{comment}":driver_shared_library",
     %{comment}],
     visibility = ["//visibility:public"],
diff --git a/third_party/xla/third_party/tsl/third_party/gpus/cuda/hermetic/cuda_driver.BUILD.tpl b/third_party/xla/third_party/tsl/third_party/gpus/cuda/hermetic/cuda_driver.BUILD.tpl
index 528a1db7..6b280aa9 100644
--- a/third_party/xla/third_party/tsl/third_party/gpus/cuda/hermetic/cuda_driver.BUILD.tpl
+++ b/third_party/xla/third_party/tsl/third_party/gpus/cuda/hermetic/cuda_driver.BUILD.tpl
@@ -8,11 +8,6 @@ cc_import(
     shared_library = "lib/libcuda.so.%{libcuda_version}",
 )

-cc_import(
-    name = "libcuda_so_1",
-    shared_library = "lib/libcuda.so.1",
-)
-
 # TODO(ybaturina): remove workaround when min CUDNN version in JAX is updated to
 # 9.3.0.
 # Workaround for adding path of driver library symlink to RPATH of cc_binaries.
@@ -45,7 +40,6 @@ cc_library(
     %{comment}deps = [
         %{comment}":libcuda_so",
         %{comment}":fake_libcuda",
-        %{comment}":libcuda_so_1",
         %{comment}":driver_shared_library",
     %{comment}],
     visibility = ["//visibility:public"],

locally to see what happens.

Update: without copying the libcuda.so.1 compilation still fails. But when you do it and inspect the so file.

objdump -T _pywrap_cpu_feature_guard.so

There is no mention of the need for libcuda …

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Feb 3, 2025

@mgorny, @isuruf I don’t know if you have time to take a look at the overlinking that I beleive is happening for some parts of the tensorflow library to libcuda.so.1.

I don’t think we should need to depend on cuda-compat at build time and the libraries that is failing to import should be importable even without a GPU.

I’ve tried. I just don’t know what to do at this point.

@ehfd
Copy link
Member

ehfd commented Feb 4, 2025

I suggest reading #417, as this has major implications on CUDA dependency. Please assist in proving otherwise on an older driver version if this issue doesn't exist (somehow) in conda-forge.

Not an issue after testing.

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Feb 4, 2025

Please prove otherwise on an older driver version if this issue doesn't exist (somehow) in conda-forge.

I think the best way to help is to build this package locally with the existing versions, and test it on real hardware. If it fails, we could add new cuda builds, again, you can build and test locally and report your findings.

@ehfd
Copy link
Member

ehfd commented Feb 4, 2025

Would it possible to build this branch right now in the current state?

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Feb 4, 2025

Would it possible to build this branch right now in the current state?

Generally there is concensus. However, the libcuda.so.1 issue makes me worried that we are overlinking everywhere which makes me hesitant.

@ehfd
Copy link
Member

ehfd commented Feb 5, 2025

@hmaarrfk I am happy to announce #417 is not a concern, and the conda-forge build is resilient to this issue unlike pip.

Just one thing I needed was os.environ["XLA_FLAGS"] = "--xla_gpu_cuda_data_dir=" + os.environ["CONDA_PREFIX"].

However, I did have corruption issues while using keras_hub.layers.TransformerEncoder, installed with pip after installing TensorFlow and Keras with conda. This merely justifies the need to build a keras-hub-feedstock (conda-forge/staged-recipes#28970) and a tensorflow-text-feedstock (conda-forge/staged-recipes#29008, mandatory dependency to keras-hub).

@vyasr
Copy link

vyasr commented Feb 5, 2025

  File "$SP_DIR/tensorflow/python/platform/self_check.py", line 63, in preload_check
    from tensorflow.python.platform import _pywrap_cpu_feature_guard
ImportError: libnvrtc-builtins.so.12.6: cannot open shared object file: No such file or directory

Yeah, that .so.12.6 indicates that tensorflow is coupled so tightly to the CUDA version used to build it, that we don't actually have forward compatibility (i.e. . Not sure if this is expected @conda-forge/cuda?

Looking at the content of cuda-nvrtc, it's apparent that the builtins don't have major-level compatibility:

lib/libnvrtc-builtins.so.12.6
lib/libnvrtc-builtins.so.12.6.85
lib/libnvrtc.so.12
lib/libnvrtc.so.12.6.85
targets/x86_64-linux/lib/libnvrtc-builtins.so.12.6
targets/x86_64-linux/lib/libnvrtc-builtins.so.12.6.85
targets/x86_64-linux/lib/libnvrtc.so.12
targets/x86_64-linux/lib/libnvrtc.so.12.6.85

Perhaps it's unexpected that tensorflow links to libnvrtc-builtin.so directly? 🤔

In any case, it seems we should add a run-constraint on cuda-version {{ cuda_compiler_version }} for now.

The NVRTC-builtins library is indeed only intended for internal use and should not be linked to directly (discussed here in the docs). That library does not offer the same major version compatibility guarantees as the CUDA runtime does.

Regarding the tensorflow linkage, I don't know Tensorflow's build system at all but I know it's using Bazel as the primary driver. I'm not sure how linkage to CUDA is handled. If there is some part of the Bazel build system that is invoking a CMake build you might be running afoul of something like https://discourse.cmake.org/t/cmake-incorrectly-links-to-nvrtc-builtins/12723, which was fixed in https://gitlab.kitware.com/cmake/cmake/-/merge_requests/9890. If it's not using CMake but instead handling CUDA linkage directly in Bazel perhaps there is an analogous build rule that needs to be fixed to handle compatibility correctly.

@h-vetinari
Copy link
Member Author

Thanks so much for the context @vyasr! I see that @hmaarrfk has already raised tensorflow/tensorflow#86413 - I guess we should backport that here?

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Feb 5, 2025

I guess we should backport that here?

it is backported. Though I still can’t find the line that tries to link to libcuda.so.1

@vyasr
Copy link

vyasr commented Feb 6, 2025

I guess we should backport that here?

it is backported. Though I still can’t find the line that tries to link to libcuda.so.1

I haven't looked through the history of this thread to see why you need it (are you also trying to prevent linkage to the CUDA driver?), but it looks like those build rules are defined here. In particular they seem to be explicitly adding a dependency on libcuda.so.1.

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Feb 6, 2025

@vyasr i understand that libcuda is needed somewhere in the library, but for now it seems to be getting pulled in by some CPUlooking sections when it didn’t in 2.17

#414 (comment)

see this specific comment

@vyasr
Copy link

vyasr commented Feb 6, 2025

Got it. Yeah I'm afraid I'd have to familiarize myself with a lot more Bazel to be of help here. It's not clear to me what the entrypoint even is to the hermetic GPU build paths. From a quick glance it seems like most of the BUILD files use tf_cuda_library, which is defined here with the underlying command coming from xla but I don't immediately see how that would lead to the cuda build rules defined in the top-level thirdparty directory.

@njzjz
Copy link
Member

njzjz commented Feb 7, 2025

I'm going to be trying with the following patch:

diff --git a/third_party/gpus/cuda/hermetic/cuda_driver.BUILD.tpl b/third_party/gpus/cuda/hermetic/cuda_driver.BUILD.tpl
index 528a1db7..6b280aa9 100644
--- a/third_party/gpus/cuda/hermetic/cuda_driver.BUILD.tpl
+++ b/third_party/gpus/cuda/hermetic/cuda_driver.BUILD.tpl

cuda_driver.BUILD.tpl may not be used, see https://github.com/tensorflow/tensorflow/blob/6550e4bd80223cdb8be6c3afd1f81e86a4d433c3/third_party/gpus/cuda/hermetic/cuda_cudart.BUILD.tpl#L31-L40

cc_library(
    name = "cudart",
    %{comment}deps = select({
        %{comment}"@cuda_driver//:forward_compatibility": ["@cuda_driver//:nvidia_driver"],
        %{comment}"//conditions:default": [":cuda_driver"],
    %{comment}}) + [
        %{comment}":cudart_shared_library",
    %{comment}],
    visibility = ["//visibility:public"],
)

The flag @cuda_driver//:forward_compatibility is false by default and is not enabled. So, it may use cuda_stub in cuda_cudart.BUILD.tpl:

cc_import(
    name = "cuda_stub",
    interface_library = "lib/stubs/libcuda.so",
    system_provided = 1,
)

@h-vetinari h-vetinari merged commit 4411337 into conda-forge:main Feb 9, 2025
4 checks passed
@h-vetinari h-vetinari deleted the 2.18 branch February 9, 2025 10:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants