This recipe shows how to run Horovod distributed training framework for PyTorch using Batch AI.
- Standard Horovod pytorch_mnist.py example will be used;
- pytorch_mnist.py downloads training data on its own during execution;
- The job will be run on standard tensorflow container batchaitraining/pytorch:0.4.0-cp36-cuda9-cudnn7;
- Horovod framework will be installed in the container using job preparation command line. Note, you can build your own docker image containing tensorflow and horovod instead.
- Standard output of the job will be stored on Azure File Share.
You can find Jupyter Notebook for this recipe in Horovod-PyTorch.ipynb.
You can find Azure CLI 2.0 instructions for this recipe in cli-instructions.md.
Under construction...
If you have any problems or questions, you can reach the Batch AI team at [email protected] or you can create an issue on GitHub.
We also welcome your contributions of additional sample notebooks, scripts, or other examples of working with Batch AI.