Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mpirun -np 4 python mpi_ops_test.py failed #9

Open
abhishekcs10 opened this issue Jun 9, 2017 · 2 comments
Open

mpirun -np 4 python mpi_ops_test.py failed #9

abhishekcs10 opened this issue Jun 9, 2017 · 2 comments

Comments

@abhishekcs10
Copy link

NOTE: Only file GitHub issues for bugs and feature requests. All other topics will be closed.

For general support from the community, see StackOverflow.
To make bugs and feature requests more easy to find and organize, we close issues that are deemed
out of scope for GitHub Issues and point people to StackOverflow.

For bugs or installation issues, please provide the following information.
The more information you provide, the more easily we will be able to offer
help and advice.

What related GitHub issues or StackOverflow threads have you found by searching the web for your problem?

No threads available

Environment info

Operating System: Ubuntu 16.04.1 LTS

Installed version of CUDA and cuDNN: CUDA-8.0 and cuDNN - 5.1.10
(please attach the output of ls -l /path/to/cuda/lib/libcud*):

-rw-r--r-- 1 root root 559800 Jan 26 17:10 /usr/local/cuda-8.0/lib64/libcudadevrt.a
lrwxrwxrwx 1 root root 16 Jan 26 17:13 /usr/local/cuda-8.0/lib64/libcudart.so -> libcudart.so.8.0
lrwxrwxrwx 1 root root 19 Jan 26 17:13 /usr/local/cuda-8.0/lib64/libcudart.so.8.0 -> libcudart.so.8.0.61
-rw-r--r-- 1 root root 476024 Jan 26 17:10 /usr/local/cuda-8.0/lib64/libcudart.so.8.0.61
-rw-r--r-- 1 root root 966166 Jan 26 17:10 /usr/local/cuda-8.0/lib64/libcudart_static.a

If installed from binary pip package, provide:

  1. A link to the pip package you installed:
  2. The output from python -c "import tensorflow; print(tensorflow.__version__)".
    I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
    I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so.5.1.10 locally
    I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
    I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
    I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
    0.12.1

If installed from source, provide

  1. The commit hash (git rev-parse HEAD)
  2. The output of bazel version

Build label: 0.4.3-2017-01-24 (@6fc5c53)
Build target: bazel-out/local-opt/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
Build time: Tue Jan 24 20:34:16 2017 (1485290056)
Build timestamp: 1485290056
Build timestamp as int: 1485290056

If possible, provide a minimal reproducible example (We usually don't have time to read hundreds of lines of your code)

Running mpi_ops_test.py gives error

mpirun -np 4 python mpi_ops_test.py

FailedPreconditionError (see above for traceback): MPI has not been initialized; use tf.contrib.mpi.Session.
[[Node: MPISize = MPISize_device="/job:localhost/replica:0/task:0/cpu:0"]]

Logs or other output that would be helpful

(If logs are large, please upload as attachment or provide link).

ERROR: test_mpi_allreduce_error (main.MPITests)
Test that the allreduce raises an error if different ranks try to

Traceback (most recent call last):
File "mpi_ops_test.py", line 162, in test_mpi_allreduce_error
rank = session.run(mpi.rank())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 766, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 964, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1014, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1034, in _do_call
raise type(e)(node_def, op, message)
FailedPreconditionError: MPI has not been initialized; use tf.contrib.mpi.Session.
[[Node: MPIRank = MPIRank_device="/job:localhost/replica:0/task:0/cpu:0"]]

Caused by op u'MPIRank', defined at:
File "mpi_ops_test.py", line 301, in
tf.test.main()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/test.py", line 91, in main
return _googletest.main()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/googletest.py", line 84, in main
benchmark.benchmarks_main(true_main=g_main)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/benchmark.py", line 323, in benchmarks_main
true_main()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/googletest.py", line 58, in g_main
return unittest_main(*args, **kwargs)
File "/usr/lib/python2.7/unittest/main.py", line 95, in init
self.runTests()
File "/usr/lib/python2.7/unittest/main.py", line 232, in runTests
self.result = testRunner.run(self.test)
File "/usr/lib/python2.7/unittest/runner.py", line 151, in run
test(result)
return self.run(*args, **kwds)
File "/usr/lib/python2.7/unittest/suite.py", line 108, in run
test(result)
File "/usr/lib/python2.7/unittest/case.py", line 393, in call
return self.run(*args, **kwds)
File "/usr/lib/python2.7/unittest/case.py", line 329, in run
testMethod()
File "mpi_ops_test.py", line 81, in test_mpi_size
size = session.run(mpi.size())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/mpi/mpi_ops.py", line 68, in size
return MPI_LIB.mpi_size(name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 759, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2240, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1128, in init
self._traceback = _extract_stack()

FailedPreconditionError (see above for traceback): MPI has not been initialized; use tf.contrib.mpi.Session.
[[Node: MPISize = MPISize_device="/job:localhost/replica:0/task:0/cpu:0"]]

@abhishekcs10
Copy link
Author

abhishekcs10 commented Jun 9, 2017

can please somebody explain where MPI allreduce is called while running allreduce-test.py?

@chengdianxuezi
Copy link

have you run the distribute mpi demo sucessfully?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants