Advantages:
- Low cpu usage
- low latency
- no extra copy from/to system memory
- easy to use
Requirements:
- UVA support (available since )
dependencies:
- automake
- flex
apt install automake flex
[color=#10d19a]currently only openmpi-1.10 is supported by pytorch's compile system. [name=Stone sky] [time=Thu, Apr 5, 2018 9:56 PM]
[color=#10d19a]openmpi-3.1.1 can compile with pytorch mater branch
[name=Stone sky] [time=Wed, Jul 25, 2018 11:39 PM]
download the source file from internet, you may refer to up-to-date page
wget https://www.open-mpi.org/software/ompi/v2.1/downloads/openmpi-2.1.3.tar.gz # TODO: change to URL of 1.10.7
tar xvf openmpi-1.10.7.tar.gz
The newest release at this time is 3.0.1, but I failed to build on that due to a building bug. build from source with CUDA support.
cd openmpi-1.10.7
mkdir build && cd build
../configure --with-cuda --enable-mpi-thread-multiple # it's not tab completed by zsh
If your CUDA location is not /usr/local/cuda
or you want to compile with non-default CUDA version, you may follow the official-CUDA-tutorial for customized build options.
- build with system-wide mpi (older version)
Since the building system of pytorch looks for libmpi
and libmpicxx
at /usr/lib
, while the default install path of open-mpi is /usr/local/lib
. The general building process will raise error mpi not found
for that. You can either copy/link the .so
libraries or specify extra linking flags to compile successfully.
Workaround for pytorch
sudo cp /usr/local/lib/libmpi* /usr/lib
# compile pytorch
python setup.py clean
python setup.py build develop
# (optional, delete the redundant files)
sudo rm /usr/lib/libmpi*
Then export the libraries to LD_LIBRARY_PATH
in case of file not found
error.
export LD_LIBRARY_PATH="/usr/local/lib:$LD_LIBRARY_PATH"
- build with arbitary version of mpi
Pytorch uses the find_MPI package bundled with CMAKE. In the newest CMAKE, it can automatically detect the MPI's lib and include path if an MPI compatible compiler is specified.
e.g.
python setup.py clean
CMAKE_C_COMPILER=$(which mpicc) CMAKE_CXX_COMPILER=$(which mpicxx) python setup.py build develop
-
simply list the dynamic libraries linked to pytorch's run-time.
ldd torch/*.so
. If compiled with MPI, you can findlibmpi.so
. If compiled with CUDA-aware MPI, you can findlibopen-rte.so
. -
run test-code
import torch
import torch.distributed as dist
dist.init_process_group(backend='mpi')
t = torch.zeros(5,5).fill_(dist.get_rank()).cuda()
dist.all_reduce(t) # ???