This example demonstrates how to run standard ChainerMN train_mnist.py distributed training job using Batch AI with Infiniband enabled.
- Standard chainer sample script train_mnist.py is used;
- Chainer downloads the standard MNIST Database on its own and distributed across workers;
- Standard output of the job and the model will be stored on Azure File Share.
- IntelMPI (non-CUDA-aware) will be used to launch ChainerMN jobs cross nodes
You can find Jupyter Notebook for this recipe in Chainer-GPU-Distributed-Infiniband.ipynb.
You can find Azure CLI 2.0 instructions for this recipe in cli-instructions.md.
The Dockerfile
for the Docker images used in this recipe can be found here. The dockerfile is a modified version of ChainerMN example built based on IntelMPI library.
Under construction...
If you have any problems or questions, you can reach the Batch AI team at [email protected] or you can create an issue on GitHub.
We also welcome your contributions of additional sample notebooks, scripts, or other examples of working with Batch AI.