Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ray Core]: Cannot find gpu on Jetson AGX Orin #46263

Closed
lfermoselle-bdai opened this issue Jun 25, 2024 · 2 comments · Fixed by #46821
Closed

[Ray Core]: Cannot find gpu on Jetson AGX Orin #46263

lfermoselle-bdai opened this issue Jun 25, 2024 · 2 comments · Fixed by #46821
Labels
core Issues that should be addressed in Ray Core docs An issue or change related to documentation good first issue Great starter issue for someone just starting to contribute to Ray P1 Issue that should be fixed within a few weeks

Comments

@lfermoselle-bdai
Copy link

lfermoselle-bdai commented Jun 25, 2024

I have installed ray via pip with pip install ray[all] on my Nvidia AGX Orin platform (aarch64), but ray fails to detect my gpu. I confirm I have set the environment variable CUDA_VISIBLE_DEVICES=0 and that torch is able to see the gpu. Both nvcc and nvidia-smi have no trouble detecting the gpu.

I attempted as well to build ray from source following the instructions here but it crashed with the following error:

   Use --sandbox_debug to see verbose messages from the sandbox and retain the sandbox build root for debugging                                                                       
    In file included from /usr/include/string.h:535,                                                                                                                                   
                     from external/upb/upb/collections/array_internal.h:31,                                                                                                            
                     from bazel-out/aarch64-opt-exec-2B5CBBC6/bin/external/com_google_protobuf/src/google/protobuf/descriptor.upb.h:12,                                                
                     from external/upb/upbc/protoc-gen-upbdefs.cc:28:                                                                                                                  
    In function 'void* memcpy(void*, const void*, size_t)',                                                                                                                            
        inlined from 'void _upb_MiniTable_CopyFieldData(void*, const void*, const upb_MiniTableField*)' at external/upb/upb/message/accessors_internal.h:119:13,                       
        inlined from 'void _upb_Message_GetNonExtensionField(const upb_Message*, const upb_MiniTableField*, const void*, void*)' at external/upb/upb/message/accessors_internal.h:210:3
3,                                                                                                                                                                                     
        inlined from 'const upb_Array* upb_Message_GetArray(const upb_Message*, const upb_MiniTableField*)' at external/upb/upb/message/accessors.h:297:36:                            
    /usr/include/aarch64-linux-gnu/bits/string_fortified.h:29:33: error: 'void* __builtin___memcpy_chk(void*, const void*, long unsigned int, long unsigned int)' writing 16 bytes into
 a region of size 8 overflows the destination [-Werror=stringop-overflow=]                                                                                                             
       29 |   return __builtin___memcpy_chk (__dest, __src, __len,                                                                                                                     
          |          ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~                                                                                                                     
       30 |                                  __glibc_objsize0 (__dest));                                                                                                               
          |                                  ~~~~~~~~~~~~~~~~~~~~~~~~~~                                                                                                                
    In function 'void* memcpy(void*, const void*, size_t)',                                                                                                                            
        inlined from 'void _upb_MiniTable_CopyFieldData(void*, const void*, const upb_MiniTableField*)' at external/upb/upb/message/accessors_internal.h:119:13,                       
        inlined from 'void _upb_Message_GetNonExtensionField(const upb_Message*, const upb_MiniTableField*, const void*, void*)' at external/upb/upb/message/accessors_internal.h:213:3
1,                                                                                                                                                                                     
        inlined from 'const upb_Array* upb_Message_GetArray(const upb_Message*, const upb_MiniTableField*)' at external/upb/upb/message/accessors.h:297:36:                            
    /usr/include/aarch64-linux-gnu/bits/string_fortified.h:29:33: error: 'void* __builtin___memcpy_chk(void*, const void*, long unsigned int, long unsigned int)' writing 16 bytes into
 a region of size 8 overflows the destination [-Werror=stringop-overflow=]                                                                                                             
       29 |   return __builtin___memcpy_chk (__dest, __src, __len,                                                                                                                     
          |          ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~                                                                                                                     
       30 |                                  __glibc_objsize0 (__dest));                                                                                                               
          |                                  ~~~~~~~~~~~~~~~~~~~~~~~~~~                                                                                                                
    cc1plus: all warnings being treated as errors                                                                                                                                      
    ERROR: /home/bdai/.cache/bazel/_bazel_bdai/20e1744cc6512cc3edcc909824a24c9a/external/com_github_cncf_udpa/xds/type/v3/BUILD:5:18 Middleman _middlemen/@com_Ugithub_Ucncf_Uudpa_S_Sx
ds_Stype_Sv3_Cpkg.upbdefs-BazelCppSemantics_build_arch_aarch64-opt failed: (Exit 1): gcc failed: error executing command (from target @upb//upbc:protoc-gen-upbdefs)                   
      (cd /home/bdai/.cache/bazel/_bazel_bdai/20e1744cc6512cc3edcc909824a24c9a/sandbox/linux-sandbox/3705/execroot/com_github_ray_project_ray && \                                     
      exec env - \                                                                                                                                                                     
        LD_LIBRARY_PATH=/workspaces/bdai/_build/install/whisper_ros/lib:/workspaces/bdai/_build/install/whisper_msgs/lib:/workspaces/bdai/_build/install/semiautonomy_msgs/lib:/workspa
ces/bdai/_build/install/physqa_msgs/lib:/workspaces/bdai/_build/install/perception_interfaces_msgs/lib:/workspaces/bdai/_build/install/ouster_ros/lib:/workspaces/bdai/_build/install/o
uster_sensor_msgs/lib:/workspaces/bdai/_build/install/object_graph_msgs/lib:/workspaces/bdai/_build/install/lio_sam/lib:/workspaces/bdai/_build/install/language_grasping_ros2/lib:/wor
kspaces/bdai/_build/install/spot_driver/lib:/workspaces/bdai/_build/install/spot_utilities_msgs/lib:/workspaces/bdai/_build/install/spot_msgs/lib:/workspaces/bdai/_build/install/reals
ense2_camera/lib:/workspaces/bdai/_build/install/realsense2_camera_msgs/lib:/workspaces/bdai/_build/install/modelbridge_msgs/lib:/workspaces/bdai/_build/install/distance_map/lib:/work
spaces/bdai/_build/install/boss_msgs/lib:/workspaces/bdai/_build/install/bdai_ros/lib:/workspaces/bdai/_build/install/bdai_msgs/lib:/workspaces/bdai/_build/install/audio_common_msgs/l
ib:/workspaces/bdai/_build/install/active_mapping_msgs/lib:/opt/ros/humble/opt/rviz_ogre_vendor/lib:/opt/ros/humble/lib/aarch64-linux-gnu:/opt/ros/humble/lib:/usr/local/cuda/compat:/u
sr/local/cuda/lib64: \                                                                                                                                                                 
        PATH=/home/bdai/.cache/bazelisk/downloads/sha256/5afe973cadc036496cac66f1414ca9be36881423f576db363d83afc9084c0c2f/bin:/workspaces/bdai/_build/install/bdai_ament_clang_tidy/bin
:/workspaces/bdai/scripts:/home/bdai/.local/bin:/opt/ros/humble/bin:/usr/local/cuda/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin \             
        PWD=/proc/self/cwd \ 
       /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBU[51/1881]
ion-sections -fdata-sections '-std=c++14' -MD -MF bazel-out/aarch64-opt-exec-2B5CBBC6/bin/external/upb/upbc/_objs/protoc-gen-upbdefs/protoc-gen-upbdefs.d '-frandom-seed=bazel-out/aarc
h64-opt-exec-2B5CBBC6/bin/external/upb/upbc/_objs/protoc-gen-upbdefs/protoc-gen-upbdefs.o' '-DBAZEL_CURRENT_REPOSITORY="upb"' -iquote external/upb -iquote bazel-out/aarch64-opt-exec-2
B5CBBC6/bin/external/upb -iquote external/com_google_absl -iquote bazel-out/aarch64-opt-exec-2B5CBBC6/bin/external/com_google_absl -iquote external/utf8_range -iquote bazel-out/aarch6
4-opt-exec-2B5CBBC6/bin/external/utf8_range -iquote external/com_google_protobuf -iquote bazel-out/aarch64-opt-exec-2B5CBBC6/bin/external/com_google_protobuf -iquote external/bazel_to
ols -iquote bazel-out/aarch64-opt-exec-2B5CBBC6/bin/external/bazel_tools -Ibazel-out/aarch64-opt-exec-2B5CBBC6/bin/external/com_google_protobuf/src -g0 -g0 '-std=c++17' -Wextra -Werro
r -Wno-unused-parameter -Wno-long-long -fno-canonical-system-headers -Wno-builtin-macro-redefined '-D__DATE__="redacted"' '-D__TIMESTAMP__="redacted"' '-D__TIME__="redacted"' -c exter
nal/upb/upbc/protoc-gen-upbdefs.cc -o bazel-out/aarch64-opt-exec-2B5CBBC6/bin/external/upb/upbc/_objs/protoc-gen-upbdefs/protoc-gen-upbdefs.o)
    # Configuration: c0a9aa16e8f0a405e7c69c72fcdd62e09ec964e564e2b53f3809888cc7f32fb6
    # Execution platform: @local_config_platform//:host

    Use --sandbox_debug to see verbose messages from the sandbox and retain the sandbox build root for debugging
    INFO: Elapsed time: 6.230s, Critical Path: 4.68s
    INFO: 106 processes: 13 internal, 93 linux-sandbox.
    FAILED: Build did NOT complete successfully
    Traceback (most recent call last):
      File "<string>", line 2, in <module>
      File "<pip-setuptools-caller>", line 34, in <module>
      File "/workspaces/bdai/ray/python/setup.py", line 755, in <module>
        setuptools.setup(
      File "/usr/local/lib/python3.10/dist-packages/setuptools/__init__.py", line 87, in setup
        return distutils.core.setup(**attrs)
      File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/core.py", line 185, in setup
        return run_commands(dist)
      File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/core.py", line 201, in run_commands
        dist.run_commands()
      File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/dist.py", line 969, in run_commands
        self.run_command(cmd)
      File "/usr/local/lib/python3.10/dist-packages/setuptools/dist.py", line 1208, in run_command
        super().run_command(command)
      File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/dist.py", line 988, in run_command
        cmd_obj.run()
      File "/usr/local/lib/python3.10/dist-packages/setuptools/command/develop.py", line 34, in run
        self.install_for_development()
      File "/usr/local/lib/python3.10/dist-packages/setuptools/command/develop.py", line 114, in install_for_development
        self.run_command('build_ext')
      File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/cmd.py", line 318, in run_command
        self.distribution.run_command(command)
      File "/usr/local/lib/python3.10/dist-packages/setuptools/dist.py", line 1208, in run_command
        super().run_command(command)
      File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/dist.py", line 988, in run_command
        cmd_obj.run()
      File "/workspaces/bdai/ray/python/setup.py", line 743, in run
        return pip_run(self)
      File "/workspaces/bdai/ray/python/setup.py", line 647, in pip_run
        build(True, BUILD_JAVA, True)
      File "/workspaces/bdai/ray/python/setup.py", line 595, in build
        return bazel_invoke(
      File "/workspaces/bdai/ray/python/setup.py", line 374, in bazel_invoke
        result = invoker([cmd] + cmdline, *args, **kwargs)
      File "/usr/lib/python3.10/subprocess.py", line 369, in check_call
        raise CalledProcessError(retcode, cmd)
    subprocess.CalledProcessError: Command '['bazel', 'build', '--verbose_failures', '--', '//:ray_pkg', '//cpp:ray_cpp_pkg']' returned non-zero exit status 1.
    error: subprocess-exited-with-error

Am I missing any setup steps to ensure ray can detect the gpu on my Jetson Orin AGX?

Versions / Dependencies

ray==2.30.0
ray-cpp=2.30.0

Reproduction script

On a Jetson Orin AGX platform after installing ray with pip install ray[all], running ipython:

import ray
ray.init(num_gpus=1)
ray.get_gpu_ids()

which returns an empty [].

Issue Severity

High: It blocks me from completing my task.

@lfermoselle-bdai lfermoselle-bdai added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 25, 2024
@anyscalesam anyscalesam added the core Issues that should be addressed in Ray Core label Jul 8, 2024
@jjyao jjyao added question Just a question :) P1 Issue that should be fixed within a few weeks and removed bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 8, 2024
@jjyao
Copy link
Collaborator

jjyao commented Jul 8, 2024

ray.get_gpu_ids() returns the gpus assigned to the worker process. So you should only call it inside a task or actor that requests GPU resource.

@jjyao jjyao added the docs An issue or change related to documentation label Jul 8, 2024
@jjyao
Copy link
Collaborator

jjyao commented Jul 8, 2024

We should update the docstring of get_gpu_ids() to make it clear that it shouldn't be called from the driver.

@jjyao jjyao added good first issue Great starter issue for someone just starting to contribute to Ray and removed question Just a question :) labels Jul 8, 2024
petern48 added a commit to petern48/ray that referenced this issue Jul 26, 2024
petern48 added a commit to petern48/ray that referenced this issue Jul 29, 2024
petern48 added a commit to petern48/ray that referenced this issue Jul 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Issues that should be addressed in Ray Core docs An issue or change related to documentation good first issue Great starter issue for someone just starting to contribute to Ray P1 Issue that should be fixed within a few weeks
Projects
None yet
3 participants