Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mpirun -n 2 failure #12

Open
gangliao opened this issue Oct 25, 2017 · 0 comments
Open

mpirun -n 2 failure #12

gangliao opened this issue Oct 25, 2017 · 0 comments

Comments

@gangliao
Copy link

   def test_mpi_allreduce_cpu(self):
        """Test on CPU that the allreduce correctly sums 1D, 2D, 3D tensors."""
        with mpi.Session() as session:
            size = session.run(mpi.size())

            dtypes = [tf.int32, tf.float32]
            dims = [1, 2, 3]
            for dtype, dim in itertools.product(dtypes, dims):
                tf.set_random_seed(1234)
                tensor = tf.random_uniform([17] * dim, -100, 100,
                                           dtype=dtype)
                summed = mpi.allreduce(tensor, average=False)
                multiplied = tensor * size
                max_difference = tf.reduce_max(tf.abs(summed - multiplied))

                # Threshold for floating point equality depends on number of
                # ranks, since we're comparing against precise multiplication.
                if size <= 3:
                    threshold = 0
                elif size < 10:
                    threshold = 1e-4
                elif size < 15:
                    threshold = 5e-4
                else:
                    break

                diff = session.run(max_difference)
                self.assertTrue(diff <= threshold,
                                "mpi.allreduce produces incorrect results")
mpirun -n 1 python allgather.py

True
True
True
True
True
True
...
mpirun -n 2 python allgather.py

2017-10-25 13:13:34.376886: E tensorflow/stream_executor/cuda/cuda_driver.cc:924] failed to allocate 998.75M (1047265280 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2017-10-25 13:13:34.379151: E tensorflow/stream_executor/cuda/cuda_driver.cc:924] failed to allocate 898.88M (942538752 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2017-10-25 13:13:34.381419: E tensorflow/stream_executor/cuda/cuda_driver.cc:924] failed to allocate 808.99M (848284928 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2017-10-25 13:13:34.383677: E tensorflow/stream_executor/cuda/cuda_driver.cc:924] failed to allocate 728.09M (763456512 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2017-10-25 13:13:34.385962: E tensorflow/stream_executor/cuda/cuda_driver.cc:924] failed to allocate 655.28M (687110912 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2017-10-25 13:13:34.388247: E tensorflow/stream_executor/cuda/cuda_driver.cc:924] failed to allocate 589.75M (618400000 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
False
False
False
False
False
False
False
False
False
False
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant